Prompt Injection Attacks
Last updated: June 13, 2025
Prompt injection attacks ° are a critical, rapidly advancing threat to the integrity and security of LLM °-integrated systems. Unlike attacks targeting model internals, prompt injection exploits ° the language interface itself: attackers manipulate application inputs to inject new instructions, hijack system behavior, or exfiltrate sensitive information. Over the last two years, research has moved from anecdotal exploit reports to systematic frameworks, quantitative benchmarks, and sophisticated defenses, reflecting both the prevalence and the evolving complexity of this class of attacks (Liu et al., 2023 ° , Rossi et al., 31 Jan 2024 ° ).
Background and Impact
The Nature and Scope of Prompt Injection
Prompt injection ° arises from LLMs °’ inability to reliably distinguish between trusted, intended instructions and potentially adversarial data within the application prompt. As LLMs are increasingly embedded in applications—ranging from chatbots, email clients, banking agents, to medical vision-LLMs (VLMs)—their attack surface ° extends to any context where attacker-provided content can become part of the model’s input (Alizadeh et al., 1 Jun 2025 ° , Clusmann et al., 23 Jul 2024 ° ).
The security risk is substantial:
- Goal hijacking: Attackers override the intended instruction, causing the model (and thus the application) to perform actions entirely under adversarial control.
- Data leakage: Sensitive information protected by system-level instructions (such as secrets or user PII) can be extracted (Khomsky et al., 20 Jun 2024 ° , Alizadeh et al., 1 Jun 2025 ° ).
- Erosion of system trust: Incidents in finance and healthcare demonstrate that prompt injection can compromise safety and privacy in high-stakes environments, such as leaking personal account data or causing radiology AIs to misclassify cancer lesions (Alizadeh et al., 1 Jun 2025 ° , Clusmann et al., 23 Jul 2024 ° ).
Formalization and Taxonomy
Recent research has proposed precise frameworks and principled taxonomies to structure this field:
Formal Attack Framework
A prompt injection attack is a manipulation where the adversary-controlled data prompt, , is transformed to a contaminated prompt , so that when combined with the system’s instruction, the LLM performs the attacker-desired task instead of the intended one. This is formalized as: where specifies the attack transformation strategy, and , denote the injected instruction and data (Liu et al., 2023 ° ). The model is then queried with , where, ideally, the output matches what would be produced under .
Attack Modes and Classes
A consensus taxonomy distinguishes attacks by mode (who controls the input channel) and mechanism (Liu et al., 2023 ° , Rossi et al., 31 Jan 2024 ° ):
Modes:
- Direct: Attackers submit prompts directly to the LLM (e.g., chatbots, web forms).
- Indirect: Adversarial prompts are injected via third-party sources (emails, web content, plugin-fetched data) and integrated by unaware applications.
Classes of Mechanism:
- Naive concatenation: Append the injected instruction to benign input (Liu et al., 2023 ° ).
- Escape characters: Insert formatting or control characters (e.g., newlines, tabs) to break context or delimit instruction boundaries.
- Context ignoring: Use phrases like “ignore previous instruction” to explicitly override system prompts ° or earlier text.
- Fake completion attacks: Supply a simulated response to trick the model into treating the original task as finished, then follow with the intended malicious instruction.
- Combined attacks: Compose multiple above mechanisms for increased success and transferability. The “Combined Attack”—combining escape, fake completion, and context ignoring—has been shown to outperform all prior handcrafted strategies (Liu et al., 2023 ° ).
- Adversarial suffixes, obfuscation, payload splitting: Algorithmic search or coding tricks generate suffixes or multi-part inputs that bypass filters and defenses (Rossi et al., 31 Jan 2024 ° , Pasquini et al., 6 Mar 2024 ° ).
Indirect attacks further include active (malicious prompts submitted via channels like email to LLM-augmented agents), passive (poisoning documents, web pages), user-driven (social engineering prompts for copy-paste), and training data poisoning/backdoors (Rossi et al., 31 Jan 2024 ° , Shao et al., 18 Oct 2024 ° ).
Evolution of Attack Methodology
The landscape has shifted from static, human-engineered strategies to automated and optimization-based methods:
- Automated variant generation: Tools like Maatphor iteratively generate and test variants of prompt injections, revealing that slight stylistic or contextual changes can evade current defenses (Salem et al., 2023 ° ).
- Neural (optimization-based) attacks: Neural Exec, for example, uses gradient-driven search over token space ° to find execution triggers, producing statistically diverse, highly effective attacks that persist through preprocessing (e.g., in retrieval-augmented generation pipelines) and evade pattern-based or blacklist-based defenses (Pasquini et al., 6 Mar 2024 ° ).
Defenses: State of the Art and Limitations
Prevention-Based Defenses
- Prompt Engineering: Manual reminders, delimiters, paraphrasing, or “sandwich” defenses (placing safety prompts at the end) offer mild benefits but are consistently overcome by adaptive or combined attacks (Liu et al., 2023 ° , Jia et al., 23 May 2025 ° ).
- Encoding-Based Isolation: The Base64 ° defense encodes external data, but can degrade utility on complex or multilingual tasks °. Recent work demonstrates that aggregating multiple encodings (e.g., a “mixture of encodings”) can reduce attack success to near-zero while preserving output quality ° (Zhang et al., 10 Apr 2025 ° ).
- Signed-Prompt Approaches: Only instructions cryptographically or semantically “signed” by an authorized party are recognized by the model (via prompt engineering or fine-tuning). This approach effectively nullifies classical prompt injections, even for multilingual or paraphrased ° attacks (Suo, 15 Jan 2024 ° ).
Detection-Based Defenses
- Proactive Detection (Secret Instructions): Include a “secret” in every system prompt; downstream systems or classifiers verify if the output preserves the secret. Extremely effective against standard and some adaptive attacks (Liu et al., 2023 ° , Liu et al., 15 Apr 2025 ° ).
- Embedding-Based ° Classifiers: Use prompt embeddings and classical classifiers (Random Forest, XGBoost) to discern adversarially crafted prompts; outperform existing transformer-based open-source models in precision and recall ° (Ayub et al., 29 Oct 2024 ° ).
- Attention °-Pattern Detectors: Attention Tracker ° exploits “distraction effects,” detecting cases where model attention ° shifts from original to injected instructions—achieving strong, generalizable results across architectures (Hung et al., 1 Nov 2024 ° ).
- Game-Theoretic Detection: DataSentinel adversarially fine-tunes a detector on evolving, optimization-based prompt injections, attaining low false positive and negative rates even on adaptive attacks (Liu et al., 15 Apr 2025 ° ).
- Unified Masking Approaches: UniGuardian estimates trigger-word impact by masking prompt subsets, enabling real-time, model-agnostic detection—even against subtle, instruction-like triggers (Lin et al., 18 Feb 2025 ° ).
Architectural and Training Defenses
- Structured Queries: StruQ enforces strict API-level separation between instruction and user data (using reserved tokens and input filtering), combined with fine-tuned models to ignore misplaced instructions. On benchmarked attacks, StruQ reduced attack rates to below 2% with negligible utility loss (Chen et al., 9 Feb 2024 ° ).
- Alignment-Based Mitigation and Poisoning: Both a risk and solution: careful curation of alignment data can improve robustness, but “alignment poisoning” (injecting adversarial instruction pairs into training) dramatically increases LLM susceptibility to runtime prompt injection without degrading standard performance (Shao et al., 18 Oct 2024 ° ).
Harnessing Model Behavior
Rather than suppressing the model’s inherent instruction-following tendency, referencing-based defenses instruct the LLM to process and label each recognized instruction, filtering results post hoc based on only authorized instruction tags °. This near-eliminates attack success, remains effective for new attacks or multi-instruction prompts, and maintains accuracy (Chen et al., 29 Apr 2025 ° ).
Over-defense and Real-world Constraints
Detection models often suffer over-defense: excessive false positives triggered by benign prompts that contain popular attack “trigger words,” reducing system usability °. InjecGuard ° addresses this by data-centric retraining, balancing detection power ° and access, and setting a new accuracy baseline on NotInject ° and public challenge benchmarks (Li et al., 30 Oct 2024 ° ).
Effectiveness and Evaluation: Key Insights
- Comprehensive and Adaptive Benchmarks: Robust evaluation demands not just “canned” attacks but also optimization-based, adaptive, and cross-domain scenarios, ideally covering many LLMs and target/injected prompt pairs (Liu et al., 2023 ° , Jia et al., 23 May 2025 ° ).
- No Universal Panacea: All current defenses can be circumvented by adaptive, gradient-driven attacks, especially when not evaluated across diverse prompt/task pairs (Jia et al., 23 May 2025 ° ).
- Defenses Must Retain Utility: Some strong defenses (e.g., pure Base64) significantly harm LLM task performance. Effective defenses like the mixture of encodings or referencing approach retain high output quality (Zhang et al., 10 Apr 2025 ° , Chen et al., 29 Apr 2025 ° ).
- Safety Alignment is Powerful but Fragile: Modern LLMs’ safety tuning helps avoid the most egregious leaks (e.g., passwords), but blended ° attacks or requests for “all my account details and password” still succeed with surprising frequency, especially in less well-aligned or mid-tier models (Alizadeh et al., 1 Jun 2025 ° , Wang et al., 20 May 2025 ° ).
Real-World Implications and Challenges
- Agentic and Multimodal Scenarios: Prompt injection becomes even more insidious when coupled with tool-calling ° agents, multistep workflows, or vision-LLMs. Stealth attacks ° (e.g., visually hidden prompts in medical images) can have serious consequences even without model access (Alizadeh et al., 1 Jun 2025 ° , Clusmann et al., 23 Jul 2024 ° ).
- Defensive Fragility and Arms ° Race: As new defenses appear, attackers rapidly engineer bypasses (combining obfuscation, splitting, multilingual variations, and optimization-based triggers) (Pasquini et al., 6 Mar 2024 ° , Salem et al., 2023 ° , Jia et al., 23 May 2025 ° ).
- Utility vs. Security Trade-off: Overly defensive postures erode LLM utility and user trust; under-defense exposes users, data, or system function (Li et al., 30 Oct 2024 ° , Alizadeh et al., 1 Jun 2025 ° ).
Open Directions and Recommendations
- Standardized, Multidimensional Benchmarks: Defenders and model providers should adopt open, extensible evaluation suites ° (e.g., OpenPromptInjection, NotInject, MMLU-PI) and test routinely versus adaptive attacks (Liu et al., 2023 ° , Li et al., 30 Oct 2024 ° , Jia et al., 23 May 2025 ° ).
- Defense-in-Depth: Combine architecturally enforced isolation (e.g., structured queries, signed prompts), real-time detection, and robust alignment with ongoing red-teaming and variant analysis (Chen et al., 9 Feb 2024 ° , Suo, 15 Jan 2024 ° , Salem et al., 2023 ° ).
- Usable Security: Prioritize detection models and architectures (like InjecGuard, referencing-based filtering, or UniGuardian) that maintain high utility while minimizing both under- and over-defense error rates (Chen et al., 29 Apr 2025 ° , Li et al., 30 Oct 2024 ° , Lin et al., 18 Feb 2025 ° ).
- Transparency and Collaboration: Openness in code, benchmark data, and model evaluation ° is crucial for realistic threat assessment and community progress (Liu et al., 2023 ° , Jia et al., 23 May 2025 ° , Li et al., 30 Oct 2024 ° ).
Conclusion
Prompt injection attacks ° remain one of the most pressing security challenges for LLM applications. While the field has advanced in formalization, taxonomy, and defense, no single approach has achieved robust, universal mitigation—especially under adversarial, adaptive conditions °. Continued research must unite systematic benchmarks, compositional defenses, and usability-focused design to secure the future of LLM-powered ° systems in safety-critical and privacy-sensitive domains.