Indirect Instruction Injection

Updated 22 May 2026

Indirect instruction injection is a technique where adversaries embed malicious commands into external data—such as documents, tool outputs, or multimedia—to hijack LLM behavior without altering core prompts.
Attack vectors include tool output injection, retrieval corpus poisoning, GUI/environmental manipulation, and multimodal payloads that exploit the model’s inability to separate trusted data from actionable instructions.
Defensive strategies involve prompt-based heuristics, LLM-driven instruction detection, behavioral classifiers, and strict architectural constraints to effectively lower attack success rates.

Indirect instruction injection, widely termed “indirect prompt injection” (IPI) in the literature, denotes a class of attacks targeting LLMs and multimodal agents by embedding adversarial instructions into external data sources—such as retrieved documents, tool outputs, GUI elements, or multimodal inputs like images and audio. Unlike direct prompt injection (DPI), where the adversary tampers with the core prompt seen by the model (e.g., as the direct user), IPI exploits the model’s inability to reliably distinguish between core instructions and “pure data” received during its operation. Such attacks enable adversaries to hijack downstream decision-making, induce malicious actions, or subvert user intent, all without access to the internal prompt-building process or privileged system inputs. This article surveys the definitions, mechanisms, empirical risk, methodologies, and defenses characterizing the contemporary state of indirect instruction injection research.

1. Formal Definition and Threat Modeling

IPI is characterized by adversarial instructions delivered via external or untrusted input channels. In tool- and retrieval-augmented LLM architectures, the core pipeline can be summarized as follows:

Let $u$ be a benign user instruction, and $D$ a set of data sources or tool outputs. The model’s effective input is $x = \Gamma(u, D)$ , where $\Gamma$ is the context-construction operation (including tool observations, retrieved passages, file contents, chat messages, GUI screenshots, etc.).
The adversary cannot tamper with $u$ or the immutable system prompt, but can inject a malicious payload $x_{\mathrm{inj}} \in D$ by planting instructions in $D$ that, when incorporated via $\Gamma$ , alter model behavior.
Success is defined as the model’s output, or a tool/action call, satisfying the attacker’s objective as induced by $x_{\mathrm{inj}}$ , despite $u$ never requesting it (Lu et al., 20 May 2025, Xie et al., 27 Oct 2025, Chang et al., 11 Jan 2026, Jia et al., 2024, Bagdasaryan et al., 2023, Zhao et al., 13 Apr 2026, Chang et al., 26 Sep 2025, Zhan et al., 2024).

Threat models vary by channel:

Black-box (no model weights/internals visible): typical for injected webpage content, emails, code repositories, GUIs (Lu et al., 20 May 2025, Zhao et al., 13 Apr 2026, Zhao et al., 18 May 2026).
White-box (prompt or architecture leaked): e.g., tool definition poisoning in coding IDEs or where the internal prompt is obtained (Xie et al., 27 Oct 2025).

Attack success rate (ASR) quantifies the fraction of tasks where the model follows the injected instruction rather than the user's intended one. Some variant ASR definitions stratify cases by surface (email, file, chat), attack method, or whether the action required is “direct harm,” “data exfiltration,” or “goal hijack” (Zhan et al., 2024, Zhao et al., 18 May 2026, Chen et al., 18 Jul 2025).

2. Attack Realizations and Methodological Variants

IPI attacks span distinct vectors:

Tool Output Injection: Malicious instructions are injected into responses from external tools, web APIs, database queries, or local file reads. These are treated as trusted context and typically concatenated into the model’s prompt unvetted (Zhao et al., 13 Apr 2026, An et al., 21 Aug 2025, Zhan et al., 2024, Xie et al., 27 Oct 2025).
Retrieval Corpus Poisoning: Attacker plants payload in external index, website, or document corpus. Upon a natural query, retrieval surfaces the poisoned fragment to the LLM, which then executes the embedded instruction (Chang et al., 11 Jan 2026, Chen et al., 18 Jul 2025, Chen et al., 23 Feb 2025).
GUI/Environmental Injection: Visual overlays—popups, chat windows, fake buttons—encode adversarial cues in the agent’s observation stream (screenshot or rendered DOM), altering sequential action selection (Lu et al., 20 May 2025).
Tool/Skill Description Hijack: Adversary stealthily modifies tool/skill descriptions so that the LLM agent, upon loading or dynamically reasoning, executes arbitrary behavior (not user-requested), sometimes in a query-agnostic fashion (Xie et al., 27 Oct 2025).
Multimodal Payload Injection: Instructions hidden via adversarial perturbations of images, audio, or video, so that when presented in a text+media pipeline the LLM is implicitly primed to produce attacker-chosen outputs (Bagdasaryan et al., 2023, Lu et al., 5 Dec 2025).
Chat Template Exploitation: Attacks mimic native chat role-tokens or conversational scaffolding, exploiting model-specific segmentation or authority cues. Persuasion-driven “multi-turn” variants are particularly transferable and robust (Chang et al., 26 Sep 2025).
Topic-Transition Smoothing: Instead of abrupt injected commands, the adversary engineers a fabricated dialogue or transition that massages model attention and semantic coherence toward the adversarial payload (Chen et al., 18 Jul 2025).

3. Empirical Risk, Benchmarks, and Attack Characterization

IPIs are systematically benchmarked using multi-channel, multi-goal testbeds:

InjecAgent (Zhan et al., 2024): 1,054 test cases over 17 user and 62 attacker tools, partitioned into “direct harm” and “data-stealing” attacks. ASRs on ReAct-prompted GPT-4 reach ~24%; Llama2-70B up to ~76%.
LivePI (Zhao et al., 18 May 2026): 169 attack cases across 7 input surfaces (group chat, email, file, web, gist, repo) and 12 attack families; five SOTA LLMs exhibit ASRs from 10.7% (Claude 4.6) to 29.6% (Gemini 3.1), with group chat and code-repo links achieving 100% success on most models.
AgentDojo (An et al., 21 Aug 2025, Jia et al., 2024): Multi-domain agent tasks; under IPI attack, tool-call and task utility degradation metrics reveal vulnerability even after basic prompt-based defenses.
Query-specific vs. Query-agnostic (Xie et al., 27 Oct 2025): Query-agnostic tool description poisoning yields deterministic, high-transferability attacks (ASR up to 87%), unlike the probabilistic/unstable opportunistic case.

Attack effectiveness is magnified by methods which: (a) exploit attention allocation (EVA: 80% ASR in pop-up attacks vs. 48% for static), (b) synthesize query-agnostic payloads that trigger on all user queries, (c) blend injected instructions through topic or conversational shifts—achieving >90% ASR under TopicAttack even against robustly trained/fine-tuned models (Lu et al., 20 May 2025, Chen et al., 18 Jul 2025, Xie et al., 27 Oct 2025).

4. Detection, Removal, and Structural Defenses

Detection and mitigation strategies fall into several classes:

A. Prompt/Text-based Heuristics:

Use delimiters, sandwiching, or prompt repetition (e.g., user instruction after every tool output) to reconnect model focus to user intent. These generally drop ASR only to 5–10% and are bypassed by template or multi-turn attacks (Chang et al., 26 Sep 2025, Chen et al., 23 Feb 2025).

B. LLM-based Instruction Detection:

Prompt the LLM to explicitly enumerate actionable instructions in context (IntentGuard) and block those originating from untrusted segments. Achieves low ASR (<5%), even under adaptive attacks, with minimal benign utility loss (Kang et al., 30 Nov 2025).

C. Behavioral and State-based Classifiers:

Detect “footprints” of malicious instructions via intermediate layer hidden states and gradients (spectral or behavioral signatures); SOTA classifiers yield detection rates ≳99% and can drive ASR to ~0.1% in out-of-domain attacks (Wen et al., 8 May 2025, Chen et al., 23 Feb 2025).
Feature attribution from both forward and backward passes enables highly discriminative defenses at the cost of computational overhead (Wen et al., 8 May 2025).

D. Architectural or Planning Constraints:

Enforce tool-call policy boundaries via deterministic runtime gating (ClawGuard) or strictly pre-planned tool dependency graphs (IPIGuard). These defenses prohibit unauthorized write actions regardless of LLM alignment and prevent adversarial cascades (Zhao et al., 13 Apr 2026, An et al., 21 Aug 2025).
Fine-grained policy rules (path, command, cost) enable zero ASR on all LivePI cases and preserve >99% benign task utility (Zhao et al., 18 May 2026).

E. Parsing/Structural Filtering:

Parse and extract only strictly formatted fields from tool outputs (ToolResultParsing), dropping all non-conforming text. Removes embedded instructions before prompt-construction; achieves 0.1–0.2% ASR while preserving task completion (Yu et al., 8 Jan 2026).
Neural-based attribution pruning: Identify and mask a minimal subset of KV-cache neurons (“CachePrune”) that are responsible for transitioning “data” into “instruction” interpretation. Offers strong reductions in ASR (to <1%) without format constraints or prompt engineering (Wang et al., 29 Apr 2025).

F. Multimodal and Visual Defenses:

Leverage visual attention monitoring and interpretability to identify overlays or GUI elements that draw disproportionate model saliency (EVA) (Lu et al., 20 May 2025).
For image/audio-based payloads, defensive activation steering in the model’s hidden representation space (ARGUS) quarantines attacker-induced subspace directions, preserving user-instruction accuracy while suppressing adversarial channels (Lu et al., 5 Dec 2025).

5. Key Insights, Attack Propagation, and Transferability

Several properties emerge from the cross-study synthesis:

Attention and Contextualization: High attack success is strongly correlated with the model’s attention mass over injected regions, both spatially (in GUI/vision agents) and sequentially (in RAG or chat agents) (Lu et al., 20 May 2025, Chen et al., 18 Jul 2025, Lu et al., 5 Dec 2025).
Instruction Following Bias: Modern LLMs exhibit a fundamental inability to demarcate data from actionable instructions when delivered outside protected prompt slots. This property is model-agnostic and not addressed by instruction finetuning alone (Wang et al., 29 Apr 2025, Chen et al., 23 Feb 2025, Chang et al., 11 Jan 2026).
Cross-model Vulnerability and Template Transfer: Attacks leveraging chat-template tokens or query-agnostic tool descriptions transfer with high efficacy to unseen backbones (up to +46% ASR delta in some cases), and remain resilient under moderate template perturbations or masking-based prompt defenses (Chang et al., 26 Sep 2025, Xie et al., 27 Oct 2025).
Reasoning Paradox: Reasoning-enabled or chain-of-thought models are a double-edged sword—more effective at both sophisticated, stealthy execution and sometimes at leaking traces of adversarial logic (“meta-cognitive leakage”), increasing human detectability in complex scenarios (Wirth, 19 Feb 2026).

6. Limitations of Defenses and Open Research Challenges

Despite substantial advances, notable limitations and frontiers remain:

Adaptive and Shifted Attacks: Current classifiers and prompt-based countermeasures are challenged by smoothly-blended instructions (e.g., topic-drift or transition methods), multimodal images/audio, or query-agnostic payloads with extremely low perplexity or instruction-like entropy (Chen et al., 18 Jul 2025, Lu et al., 5 Dec 2025, Bagdasaryan et al., 2023).
False Positives/Utility Loss: Overly aggressive, context-agnostic filtering can degrade benign utility, especially when out-of-domain or “clean” documents are misclassified as malicious. Maintaining a strict low FPR without loss of task success is an open analytical concern (Kang et al., 30 Nov 2025, Chen et al., 23 Feb 2025).
Scalability and Overhead: Gradient-based or neural attribution techniques may incur nontrivial computational latency, restricting deployment to offline or batch-processing contexts (Wen et al., 8 May 2025, Wang et al., 29 Apr 2025).
Semantic Parsing and Planning Requirements: Structural defenses (TDG, policy enforcement) depend on correct up-front task decompositions or user scope declarations—a challenge in open-ended or underspecified workflows, and may be incomplete for attacks not involving explicit tool calls (An et al., 21 Aug 2025, Zhao et al., 13 Apr 2026).
Multimodal and Multi-agent Extensions: Detection and filtering in image, audio, or mixed-modal settings is underexplored, as is coordinated defense for complex agent swarms or distributed workflows (Lu et al., 5 Dec 2025, Lu et al., 20 May 2025, Bagdasaryan et al., 2023, Chang et al., 11 Jan 2026).

Continued development is required in interpretable attention regularization, intention-tracing, multi-surface robustness evaluation (as exemplified by LivePI), principled input provenance control, and black-box validation pipelines for emerging deployment architectures.

7. Outlook and Research Directions

Research continues to broaden both adversarial characterization and principled defenses for indirect instruction injection:

Principled Provenance Tracking: Mapping prompt segment origin to trust/scoping labels and enforcing end-to-end channel separation, formalized in frameworks like IntentGuard and emerging crypto-signature schemes (Kang et al., 30 Nov 2025).
Active, Goal-conditioned Gating: Incorporating runtime test-time alignment and dynamic policy constraints at tool boundaries (ClawGuard, IPIGuard) to provably block unauthorized actions, even under unknown or adaptive attack surfaces (Zhao et al., 13 Apr 2026, An et al., 21 Aug 2025, Zhao et al., 18 May 2026).
Latent Behavior Steering: Modulating model internal representations and attention allocation dynamically via activation-space steering or surgical mitigation (ICON, ARGUS, CachePrune) offers universal, backbone-agnostic routes to maintaining utility under attack (Wang et al., 24 Feb 2026, Lu et al., 5 Dec 2025, Wang et al., 29 Apr 2025).
Benchmarks and Open Evaluation: The establishment of live, heterogenous testbeds (LivePI), rich multi-domain benchmarks (AgentDojo, InjecAgent), and open-source replication toolchains accelerates the empirical and theoretical refinement of attack and defense methodologies (Zhao et al., 18 May 2026, Zhan et al., 2024, An et al., 21 Aug 2025).

A plausible implication is that ongoing evolution of both adversarial and defensive sophistication will increasingly demand hybrid, layered security architectures blending detection, structural gating, provenance analysis, and interpretable reasoning introspection. Further progress will likely depend on cross-modal robustness, automated defense generalization, and the integration of human oversight for high-stakes applications.