Papers
Topics
Authors
Recent
Search
2000 character limit reached

Plan Injection Attacks in LLM Systems

Updated 25 December 2025
  • Plan injection attacks are methods that exploit untrusted input channels to embed adversarial instructions into large language model applications.
  • They leverage the blurred boundary between trusted developer prompts and user data, achieving up to 96–100% success rates in undefended systems.
  • Mitigation strategies like PromptLocate and StruQ use semantic segmentation and structured queries to localize and prevent injected plans, restoring system integrity.

A plan injection attack, also known as prompt injection, is a method by which an adversary manipulates the input to a LLM-integrated application such that the model deviates from the original developer’s intended instructions and instead executes tasks specified by the attacker. This method leverages the lack of internal distinction in LLM architectures between trusted instructions and untrusted user data, enabling the attacker to embed new actionable instructions (“plans”) within input data streams. As LLM-integrated applications have become ubiquitous, plan injection attacks represent a critical category of threats, prompting both formalization of the attack surface and active research into detection and mitigation (Chen et al., 2024, Jia et al., 14 Oct 2025).

1. Formalization and Taxonomy of Plan Injection Attacks

A plan injection attack on an LLM can be rigorously characterized by the contamination of the target data channel with an injected prompt consisting of attacker-chosen instructions and data. Let ff be an LLM that is intended to complete a "target task" specified by instruction sts_t and data xtx_t, such that the target prompt is pt=stxtp_t = s_t \,\|\, x_t and f(pt)ytf(p_t) \simeq y_t, with \simeq denoting semantic equivalence and \| denoting string concatenation.

The attacker seeks execution of an injected task via an injected instruction ses_e and injected data xex_e, forming pe=sexep_e = s_e \,\|\, x_e and f(pe)yef(p_e) \simeq y_e. The attack proceeds by embedding pep_e into xtx_t, producing contaminated data xcx_c, and the contaminated prompt pc=stxcp_c = s_t \,\|\, x_c. The attack is successful if f(pc)yef(p_c) \simeq y_e, i.e., the model’s response matches the attacker's target output (Jia et al., 14 Oct 2025).

There are two central components to a plan injection attack:

  • Injected instruction (ses_e): An adversarial directive intended to override or supersede sts_t.
  • Injected data (xex_e): Task context, content, or arguments tailored for the attacker’s objective.

Plan injection attacks exploit the absence of a strict boundary between prompts and user data in conventional LLM APIs, enabling both naïve (direct appended instructions) and sophisticated (completion mimicry, control character insertion, fake prompt markers) attack variants.

2. Attack Models, Threat Capabilities, and Observed Impacts

The canonical threat model assumes the attacker controls only xtx_t (the untrusted data channel), not the trusted prompt sts_t nor any surrounding orchestrator code (Chen et al., 2024). Attackers are further presumed to know filter schemas, input formats, and reserved tokens but are constrained in their ability to introduce reserved delimiters if those are appropriately filtered at the interface level.

Empirical evaluations reveal that, for two widely used 7B-parameter LLMs (Alpaca and Mistral), undefended systems are fully vulnerable:

This magnitude of vulnerability illustrates the endemic nature of plan injection in “one-big-text” LLM APIs and motivates preventative architectures and forensic detection frameworks.

3. Detection and Localization Strategies

Prompt injection localization is essential for forensic analysis and post-attack data recovery. PromptLocate provides a formal method for pinpointing injected instructions and data within contaminated model inputs (Jia et al., 14 Oct 2025).

PromptLocate operationalizes a three-stage process:

  1. Semantic Segmentation: The contaminated input xcx_c is partitioned into semantic segments. This is achieved by tokenizing xcx_c, embedding each word with an LLM’s embedding layer, and inserting segment boundaries where consecutive word embeddings’ cosine similarity falls below a threshold τ\tau, selected by grid search to optimize ground-truth matching.
  2. Instruction Detection: Using a fine-tuned detection LLM (DataSentinel), the system classifies each segment as clean or contaminated via group-based binary search, iteratively refining the set of instruction-contaminated segments II.
  3. Data Injection Pinpointing: Defining a Contextual Inconsistency Score (CIS), the framework searches between adjacent instruction-injected segments to flag segments where contextual probabilities indicate injected data, subject to oracle validation.

PromptLocate achieves high localization accuracy across a broad spectrum of attack strategies, including eight existing and eight adaptive attack types. On the OpenPromptInjection benchmark, ROUGE-L F1 (RL) of 0.93–0.99, embedding similarity (ES) of 0.94–0.99, and token precision/recall of 0.95–1.00 are seen across all attacks. Post-removal of identified injections, Attack Success Value (ASV) is reduced to below 0.09 (Jia et al., 14 Oct 2025).

4. Preventative Controls: Structured Queries and Interface Hardening

StruQ (Structured Queries) presents a practical defense, re-architecting the LLM interface from “one-big-text” to two-channel structured queries. The front end enforces a format:

Q=[MARK][INST][COLN]π[MARK][INPT][COLN]d[MARK][RESP][COLN]Q = [\text{MARK}]\, [\text{INST}][\text{COLN}]\, \pi\,\|\, [\text{MARK}]\, [\text{INPT}][\text{COLN}]\, d\,\|\, [\text{MARK}]\, [\text{RESP}][\text{COLN}]

where

  • π\pi is the trusted developer instruction,
  • dd is untrusted user data,
  • [MARK], [INST], [INPT], [RESP], [COLN] are reserved tokens,
  • the front end actively strips occurrences of reserved tokens from dd, ensuring the adversary cannot spoof query delimiters,
  • these tokens map to new token IDs with transferred and refined embeddings.

The LLM backend is transitioned to a structured instruction-tuned variant via fine-tuning on a mixture of clean and adversarially contaminated examples:

  • Clean (original) (pi,di,ri)(p_i,d_i,r_i) triplets,
  • Naïve-injection variants (instruction inserted into did_i),
  • Completion-Other variants (fake responses and delimiters with a bogus instruction inside did_i).

This regime implants separation: the LLM learns to attend to instructions before [INPT] and ignore everything after.

In empirical studies, StruQ reduces human-crafted attack success rates to below 2% and TAP-style attack success from 97–100% to 9–36%, with statistical parity in utility (AlpacaEval 67.2–67.6% and 80.0–78.7% on Alpaca and Mistral, respectively) (Chen et al., 2024).

5. Evaluation Metrics, Benchmarks, and Empirical Findings

Evaluation of plan injection defenses and localization strategies relies on purpose-built datasets and multi-faceted metrics:

  • OpenPromptInjection: 49 target/injected pairs over 7 tasks, spanning classification, grammar correction, and summarization, with bespoke and optimization-based attacks (Jia et al., 14 Oct 2025).
  • AgentDojo: LLM agent environments with tool-call outputs, backend GPT-4o, and injected prompts.

Key metrics include:

  • Localization: ROUGE-L F1, Embedding Similarity (Sentence-BERT), token overlap precision and recall.
  • Attack efficacy: PNA-I (injected-task performance), ASV (attack success under contamination).
  • Recovery: PNA-T (performance under no attack), PA-T, PR (after removal and re-evaluation).

PromptLocate consistently outperforms attribution baselines, drives ASV-A (attack success post-defensive localization) to ≤0.09 across adaptive attacks, and restores post-removal utility to 85–100% PNA-T. StruQ achieves <2% human-attack success and >90% reduction in automated attack success rates without loss of core LLM capability (Chen et al., 2024, Jia et al., 14 Oct 2025).

6. Limitations, Challenges, and Future Directions

Several limitations contextualize the robustness and use-cases of plan injection defenses:

  • Input Sanitization Limitations: The soft boundary between “instructions” and “data” poses challenges not paralleled in non-LLM domains like SQL or XSS, potentially complicating generalized input filtering (Jia et al., 14 Oct 2025).
  • Oracle Dependency: The efficacy of localization (PromptLocate) is bounded by the oracle's capacity; attacks that bypass oracle detection will evade downstream localization. This suggests further detector tailoring and advanced LLM-based classifiers are required.
  • Adaptive Attacks: If attackers manipulate only the data channel to induce a same-type task, rather than explicit instruction injection, localization is as difficult as adversarial example or misinformation detection.
  • Pre-trained Instruction Adherence: Structured instruction tuning is reliant on untrained (non-instruction-following) base models. If a model is already instruction-tuned to follow any directive, retraining may not sufficiently degrade the “follow all instructions” habit (Chen et al., 2024).
  • Recovery-Utility Tradeoff: Imperfect recovery may leave residual injected tokens, necessitating a choice between data rejection (denial-of-service) or tolerating small risk in recovered data.
  • Performance Considerations: PromptLocate’s runtime is ~11 seconds per sample on an NVIDIA RTX A5000 GPU, scaling sub-linearly with input length due to binary search in the contaminated segment search step.

Potential future work includes provable guarantees under compositional attack strategies, integrating localization with upstream prevention, and scaling to broader LLM architectures and application domains (Jia et al., 14 Oct 2025).


Table: Selected Plan Injection Attacks, Defenses, and Empirical Outcomes

Approach Attack/Defense Description Empirical Result
Completion Attack Naïvely or imitating delimiters; appends instructions in data Success 96–100% on undefended models (Chen et al., 2024)
Tree-Of-Attacks (TAP) Iterative, black-box feedback-driven attack 97–100% success on baseline, 9–36% after StruQ (Chen et al., 2024)
StruQ Structured query two-channel interface + filtering + instruction-tuned LLM <2% human-attack success, no utility loss (Chen et al., 2024)
PromptLocate Three-step semantic localization and removal RL 0.93–0.99, ASV-A ≤0.09 across attacks (Jia et al., 14 Oct 2025)

In summary, plan injection attacks fundamentally challenge the security posture of LLM-integrated systems due to inherent interface and model design flaws. State-of-the-art mitigation via interface refactoring (StruQ) and post hoc forensic detection (PromptLocate) demonstrate substantial resistance to attack, but ongoing adaptation in attacker tactics and the blurred boundary between instruction and data spaces ensure the topic remains an active research frontier (Chen et al., 2024, Jia et al., 14 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Plan Injection Attacks.