Papers
Topics
Authors
Recent
2000 character limit reached

Prompt Injection & Jailbreaking Attacks

Updated 29 December 2025
  • Prompt injection and jailbreaking attacks are adversarial techniques that manipulate input prompts to subvert large language model behavior.
  • They use methods like lexical camouflage, imperceptible Unicode tweaks, and structural template manipulation to bypass safety controls.
  • Recent research emphasizes defense strategies such as token sanitization, context isolation, and adversarial fine-tuning to effectively counter these threats.

Prompt injection and jailbreaking attacks constitute a fundamental class of adversarial threats to LLMs and related generative systems. These attacks manipulate model behavior at inference—often through cleverly crafted prompts or contextual cues—to elicit outputs that violate safety policies, disclose sensitive information, or override intended system functionality. This article surveys the core methodologies, taxonomies, empirical results, and defense paradigms established by recent research, with emphasis on precise terminology, formal threat models, and the state-of-the-art in both offensive and defensive mechanisms.

1. Definitions, Threat Taxonomy, and Conceptual Distinction

Prompt injection is a broad category of inference-time attack wherein the adversary crafts text fragments that, when concatenated to a user prompt or system context, subvert the model's intended operation. Jailbreaking is a specialized subclass of prompt injection that seeks to bypass safety alignment layers—typically inducing the model to produce harmful, policy-violating, or otherwise disallowed content (Wang et al., 20 May 2025, Gao et al., 6 Oct 2025).

Formally, for an LLM as a function f:XYf : X \rightarrow Y (maps input Unicode codepoints to output text), a prompt injection constructs x=xuxax = x_u \circ x_a, where xux_u is the user’s prompt and xax_a is the attacker’s injection. Jailbreaking is specifically the construction p=qsp = q \circ s, with qq a malicious question and ss an adversarial suffix, seeking f(p)f(p) that emits a harmful, normally filtered response (Gao et al., 6 Oct 2025).

A five-category taxonomy for prompt-level jailbreaks in both text and image modalities was established (Mustafa et al., 29 Jul 2025):

  • Multi-Turn Narrative Escalation: Safe and unsafe requests are interleaved to escalate harmful content over dialog turns.
  • Lexical Camouflage (Material Substitution): Disallowed tokens are replaced by benign synonyms/materials, evading lexical filters.
  • Implication Chaining: Harmful intent is distributed over several benign-seeming statements or questions, reconstructed by the model only when the chain is completed.
  • Fictional Impersonation: Unsafe instructions are framed as part of roleplay, creative writing, or hypothetical scenarios.
  • Semantic Edits/Euphemistic Reframing: Attacks are rewritten with educational or clinical language to slip past typical refusals.

Empirical evidence indicates that each moderation stage—input filtering, rewriting, classifier gating, post-generation validation—can be circumvented by one or more of these strategies.

2. Attack Techniques and Algorithmic Innovations

2.1 Direct and Indirect Prompt Attacks

Direct prompt injection places the adversarial content at the user-prompt interface. Indirect injection leverages third-party resources (e.g., HTML tags, PDF metadata, external file comments) that the LLM ingests during context construction. Multimodal variants include image-based injection, where adversarial instructions are hidden in submitted images, and data leakage attacks that prompt models to reveal system or training data (Yeo et al., 7 Sep 2025). Formally, attacks are marked successful if A(M([S,H,UI]))=1A(M([S, H, U \oplus I])) = 1 for predicate AA indicating adversary instruction compliance.

2.2 Imperceptible and Stealthy Jailbreaks

Imperceptible attacks include adversarial Unicode manipulations such as variation selectors—zero-width codepoints that do not change the visual representation but alter tokenization (Gao et al., 6 Oct 2025). The chain-of-search algorithm optimizes an invisible suffix S=[vsi1,...,vsiL]S = [vs_{i_1},...,vs_{i_L}] to maximize the log-probability of a target "acceptance" token given a harmful prompt, making injected suffixes undetectable by users and resilient to output-classifier filtering. Attack success rates (ASR) up to 100% were observed on Vicuna-13B, Mistral-7B, and Llama-2/3, with zero visible modifications.

2.3 Optimization- and Context-Driven Approaches

Techniques such as adversarial prompt distillation, genetic-programming-based prompt evolution, and reinforcement learning over candidate injection prompts have enabled efficient, black-box and white-box jailbreaks. Approaches like AGILE (Wang et al., 1 Aug 2025) exploit model hidden-state information and local gradient signals to guide minimal, semantics-preserving edits that maximize Attack Success Rate while maintaining transferability.

Large-scale ensemble pipelines (e.g., AutoJailbreak (Lu et al., 6 Jun 2024)) leverage dependency graphs to combine mutation, recombination, and scoring modules, synthesizing high-efficacy attacks by fusing genetic and adversarial-generation paradigms. Dialogue Injection Attacks exploit historical context manipulation to maximize harmful completion log-likelihoods, bypassing cross-turn alignment and legacy defenses (Meng et al., 11 Mar 2025).

2.4 Structural and Template Attacks

Recent work has revealed structural vulnerabilities exploitable by external prompt-template manipulation. SQL Injection Jailbreak (SIJ) attacks treat prompt templates analogously to SQL-injection, “commenting out” instruction boundaries and re-inserting adversarial control tokens to hijack model output, achieving ASR 100%\approx 100\% on open-source models, with minimal computation and high cross-model transferability (Zhao et al., 3 Nov 2024).

Special token injection—e.g., “virtual context” attacks that insert sequence boundary markers such as <<SEP>>—and context segmentation via repeated <<eos>> tokens can both fundamentally mislead the model’s segmentation of safe versus adversarial content (Zhou et al., 28 Jun 2024, Yu et al., 31 May 2024).

2.5 Multimodal and Cross-Modal Vectors

Adapter-based pipelines in text-to-image (T2I) diffusion models introduce hijacking attacks wherein benign image prompts are imperceptibly perturbed at the encoder feature-space level, conditioning the model to produce policy-violating NSFW content in response to user queries (Chen et al., 8 Apr 2025). Projected gradient descent in feature-space facilitates black-box, scalable, and user-blaming attacks even when text-based alignment defenses are robust.

3. Evaluation Methodologies and Empirical Findings

Attack success in this domain is measured by a variety of metrics:

Metric Definition
Attack Success Rate (ASR) Fraction of adversarial prompts for which the model yields the intended harmful output
Attack Success Probability (ASP) Fraction of outputs that fully comply with the malicious instruction, plus a weighted fraction of “uncertain” (ambiguous) outputs (Wang et al., 20 May 2025)
Harmfulness Score Judged by a large model (e.g. GPT-4), scale up to 5 or 10
Time Cost Per Sample Wall-clock time to construct an effective adversarial prompt

Recent benchmarks demonstrate that core attacks (e.g., ignore-prefix, hypnotism, imperceptible suffix) break all tested open-source LLMs (1–9B) with $60$–90%90\% ASP/ASR (Wang et al., 20 May 2025, Gao et al., 6 Oct 2025), and SIJ achieves 100%100\% ASR on multiple families (Zhao et al., 3 Nov 2024). Advanced black-box pipelines, such as activation-guided editing (Wang et al., 1 Aug 2025) and dialogue injection (Meng et al., 11 Mar 2025), show high efficacy against both open and closed-source models, bypassing at least five modern defense modules.

For image and cross-modal attacks, nudity (NSFW) rates of 70–90% can be induced in T2I systems using PGD-optimized image perturbations, which are indistinguishable to users and misclassified by system output filters (Chen et al., 8 Apr 2025).

4. Defensive Frameworks and Mitigation Strategies

Research presents a spectrum of defensive strategies, with layered, multi-modal, and context-aware pipelines emerging as necessary to counter modern attacks:

  • Input Normalization and Token Sanitization: Removal of zero-width, non-UTF8 characters; stripping, escaping, or re-tokenizing control and special tokens; regex-based cleaning of template boundaries and suspicious markers (Yeo et al., 7 Sep 2025, Rao et al., 22 Dec 2025, Zhou et al., 28 Jun 2024).
  • Semantic Feature and Linear SVM Filtering: Text normalization followed by TF-IDF vectorization and linear SVM enables cheap, interpretable, and high-specificity blockade of prompt-level and paraphrased attacks (Rao et al., 22 Dec 2025), outperforming deep neural and rule-only baselines.
  • Context Isolation and Cumulative Context Tracking: Explicit separation of untrusted context via fencing, subcontexts, and running context-aware safety classifiers across concatenated history (Mustafa et al., 29 Jul 2025).
  • Adversarially Robust Fine-Tuning and Adaptive System Prompts: Augmented safety fine-tuning with adversarial noise, randomization of prompt structure, and adaptive signature tokens disrupt structural (e.g., SIJ) attacks’ pattern-matching capabilities (Zhao et al., 3 Nov 2024).
  • Post-Generation Output Scanning: Two-pass validation with transformer-based or RoBERTa classifiers, or LLM-as-judge frameworks, to inspect output for policy violations unseen at the input level (Rao et al., 22 Dec 2025, Panebianco et al., 1 Aug 2025).
  • Semi-supervised, Static-Dynamic Hybrid Defenses: Embedding-based usage maps, density-based clustering, and anomaly scoring with SVMs/Forests, coupled with human-in-the-loop triage, yield high recall/precision against PII-leak and jailbreak attempts (Panebianco et al., 1 Aug 2025).
  • Adversarially Robust Modality Adapters: For T2I systems, adversarially trained CLIP encoders (e.g., FARE) are necessary within adapters to counter feature-space attacks that cannot be detected by downstream sample-space filters (Chen et al., 8 Apr 2025).

5. Open Problems, Robustness Limits, and Future Directions

Defensive robustness remains fundamentally limited by three properties of the attack surface:

  • Generalization and Stealth: Automated tools like Maatphor (Salem et al., 2023) and variant-generation pipelines can evolve new attack vectors and paraphrased variants that bypass both lexical and semantic pattern-matchers, unless proactively anticipated by defenders.
  • Cross-Modal and Structural Blind Spots: Modal and template-level attacks (image prompt adapters, SQL-injection-like prompt rewriters, virtual context) remain poorly defended in many production systems (Chen et al., 8 Apr 2025, Zhao et al., 3 Nov 2024, Zhou et al., 28 Jun 2024).
  • Latency, Scalability, and Cost: High-fidelity LLM-based judgment, deep transformer scans, and adversarially trained encoders introduce nontrivial computational overhead; efficient LSVMs and staged filtering are critical for real-time production deployment (Rao et al., 22 Dec 2025).

Future research will require comprehensive integration of token-level, semantic, and contextual defenses; adaptive system prompt design; dynamic thresholding with online feedback; integration of lightweight anomaly detectors for zero-day patterns; and cross-modal alignment pipelines that treat text, code, and visual context as mutually adversarial inputs.


Key references:

This synthesis represents the current landscape and salient challenges at the intersection of prompt injection, jailbreaking, and LLM security engineering.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Prompt Injection and Jailbreaking Attacks.