Dynamic Prompt Defense (DPD)
- Dynamic Prompt Defense (DPD) is a framework that dynamically modifies and optimizes prompts to reveal true intentions and counteract adversarial injections.
- It employs adaptive strategies such as prompt compression, attack inversion, reinforcement learning, and multi-agent pipelines to reduce attack success while preserving task performance.
- DPD extends to cross-modal applications like text-driven diffusion and promptable vision models, offering robust defense measured by metrics such as PSNR and mIoU recovery.
Dynamic Prompt Defense (DPD) encompasses a spectrum of defense mechanisms that proactively manipulate, compress, optimize, or adversarially alter prompts—often in a task- or attacker-adaptive fashion—to safeguard machine learning models from adversarial or malicious manipulation. While initially motivated by LLM vulnerabilities to prompt injection and jailbreak attacks, DPD also extends to cross-modal settings (e.g., image editing with text prompts) and semantic segmentation with promptable vision models. Recent research delineates diverse programmatic implementations of DPD, ranging from model-based prompt compressors for intention extraction, dynamic prompt rewriting via reinforcement learning (RL), adversarial prompt injection in offensive–defensive settings, to cyclic prompt perturbation strategies for robustness.
1. Core Principles and Taxonomy
DPD targets attacks wherein adversaries leverage the input prompt interface to induce models—LLMs or vision models controlled by prompts—to deviate from their intended safe or correct behavior. The unifying principle is to dynamically modify the prompt before, during, or after inference to either neutralize malicious payloads, reveal true user intent, or immunize against attack transferability.
There are several major DPD archetypes:
- Prompt Compression and Intention Exposure: Compressing the input to a minimal “intention” and exposing it to the model’s guardrails, e.g., SecurityLingua (Li et al., 15 Jun 2025).
- Multi-Agent Dynamic Filtering: Pipelines of coordinated agents that pre-filter, classify, and/or post-sanitize inputs/outputs, as in multi-agent defense pipelines (Hossain et al., 16 Sep 2025).
- Attack-Inversion Prompt Engineering: Inference-time prompt constructions that actively exploit and invert attack patterns—using “ignore” triggers, completion templates, or escape characters—to redirect models towards benign instructions (Chen et al., 1 Nov 2024).
- Adaptive Prompt Optimization via Online or RL-based Updates: RL-optimized dynamic rewriting of incoming prompts to continually suppress newly emerging attack sequences, including iterative jailbreaks (Kaneko et al., 19 Oct 2025).
- Adversarial Prompt Injection against Attackers: Defensive exploitation of LLM-agent vulnerabilities by planting sabotage or misdirection payloads (e.g., in decoy or honeypot systems), as implemented in Mantis (Pasquini et al., 28 Oct 2024).
- Cross-Modal Prompt Defenses in Vision: Cyclic adversarial optimization of text prompt embeddings to immunize image editing models against unseen, prompt-driven manipulations (Zhang et al., 16 Dec 2025).
- Adversarial Point Prompt Selection in Segmentation: Attack–defense game over sets of point prompts to “harden” vision models to prompt-based adversaries (Liu et al., 23 Sep 2025).
Each approach maintains a dynamic (input- and context-sensitive) element, distinguishing DPD from static, hard-coded filtering or single-shot prompt engineering.
2. Prompt Compression and Intention Extraction
SecurityLingua (Li et al., 15 Jun 2025) exemplifies DPD via security-aware prompt compression. The system consists of a pre-trained Transformer encoder with a token-level classification head. For a prompt of tokens , the model predicts, per token, a probability of being essential to the “true intention.” Tokens with are retained to form compressed intention . This compressed intention is injected into the system prompt:
- System prompt: discloses the true intention: “The user’s true intention is: ‘{c}.’ Please follow your safety guardrails and respond accordingly.”
- User prompt: passes the unchanged original prompt.
The compressor is supervised by cross-entropy loss over token labels (preserve/discard) and optionally intent classification loss on the [CLS] embedding. Experiments show negligible compute overhead (25 ms per query, 32 tokens added), compression ratio , and defense success rate ≈1% jailbreak (vs. 35% without defense, 4× better than the runner-up). Downstream task accuracy is maintained or improved.
This method renders the attacker's obfuscation moot by explicitly surfacing the underlying request to the LLM’s system guardrails while maintaining user experience.
3. Dynamic Prompt Engineering and Attack Inversion
Training-free, inference-time DPD strategies invert known attack vectors for defense. The methodology in "Defense Against Prompt Injection Attack by Leveraging Attack Techniques" (Chen et al., 1 Nov 2024) structures a composite prompt as:
where is the original instruction, is input data (which might be poisoned), is the attacker’s injection, and is a shield prompt mimicking attack patterns (e.g., “ignore previous,” escape sequences, fake completions, or multi-turn templates).
Shield prompt is selected dynamically to exploit the specific attack vector, steering the model to disregard the malice in and obey , with no retraining or fine-tuning. Empirically, DPD variants achieve near-zero attack success rates (ASR), outperforming all prior defenses both for direct and indirect prompt injections, and incur minimal latency overhead (~0.01–0.12 s/query). A notable observation is the linear correspondence between the strength of attack and defense—a “strong” attack, when inverted, yields a strong DPD variant.
4. Adaptive and Online DPD via Reinforcement Learning
Dynamic Prompt Defense has been formalized as an online Markov Decision Process where each incoming prompt is rewritten by a parameterized optimizer before dispatch to the target LLM (Kaneko et al., 19 Oct 2025). Actions are prompt rewrites, and rewards depend on the LLM’s output similarity to desired replies or refusals. Policy optimization is performed via REINFORCE, maximizing expected reward over batches of input prompts, with additional mechanisms to ensure robust adaptation:
- Past-Direction Gradient Damping (PDGD): Damps gradient updates in directions over-represented in previous adaptation steps to avoid overfitting to repeated attack vectors.
- Regularization and Replay: regularization prevents model drift, and periodic replay of past queries addresses catastrophic forgetting.
- Continuous Online Updates: The optimizer is updated dynamically in response to incoming attack variations, achieving sustained defense even against iterative, trial-and-error jailbreaks.
Reported results indicate ASR reduction by half versus the best prior prompt-only defenses, without negative impact—and in fact sometimes with improvement—on harmless prompt utility (as measured by perplexity and gold matching).
5. DPD in Multi-Agent and Real-Time Defense Architectures
"A Multi-Agent LLM Defense Pipeline Against Prompt Injection Attacks" (Hossain et al., 16 Sep 2025) operationalizes DPD as coordinated agent pipelines, either in a chain-of-agents or hierarchical coordinator form:
- Chain-of-Agents: User input → Domain LLM → Guard Agent (output filtering, formatting, token blacklists, post-hoc audit).
- Coordinator Pipeline: Pre-input classifier filters or refuses malicious queries before invoking responsive LLMs, blocking attacks preemptively.
Guard and coordinator agents implement dynamic pattern matching, code fence and delegation detection, semantic matching via policy stores, and post-generation audit buffers. Comprehensive benchmarking (400 prompt-injection attacks across 8 vectors) demonstrates 100% ASR mitigation across ChatGLM and Llama2, covering attacks such as code execution, obfuscation, delegation manipulation, and multi-turn persistence.
However, performance overhead is a consideration: the chain-of-agents modality introduces one additional LLM inference, while the coordinator approach is more lightweight but may be susceptible to increased false positives with aggressive filtering.
6. Cross-Modal and Vision-Oriented DPD
DPD generalizes beyond LLMs. In text-driven diffusion editing, DPD (Zhang et al., 16 Dec 2025) is incorporated as an optimization loop over text prompt embeddings:
- For a given immunized image , DPD cyclically optimizes a text perturbation (under norm) to find the worst-case embedding variant, then updates via FlatGrad Defense Mechanism (FDM) for image-level robustness. The process alternates, forming a dual-loop adversarial cycle.
- Theoretical and empirical analysis confirm improved transferability: DPD-empowered perturbations immunize images not just for a single editing model but for a (prompt-) neighborhood, improving defense effectiveness on both intra- and cross-model malicious editing scenarios. Quantitative metrics such as PSNR, LPIPS, and FSIM show consistent cross-model gains.
Similarly, in prompt-based vision (e.g., point prompts for the Segment Anything Model), DPD (Liu et al., 23 Sep 2025) realizes a two-agent, adversarial RL setup. An attacker agent activates point prompts to maximally degrade segmentation, while a defender agent (trained adversarially) learns to deactivate and restore segmentation accuracy. After training, the defender rapidly refines user or system-provided prompt sets at inference, elevating robustness and generalization across datasets (e.g., mIoU recovery from 21.5% to 63.5% on PASCAL-VOC under adversarial attack).
7. Offensive DPD: Defending Against Malicious LLM-Agents
In the adversarial security domain, DPD has been deployed “offensively” to neutralize or misdirect automated LLM-powered cyberattackers. The Mantis framework (Pasquini et al., 28 Oct 2024) utilizes decoy services—FTP honeypots, web-apps—that dynamically inject tailored adversarial prompts (via ANSI, HTML comments, etc.) into attacker-accessible interfaces:
- Passive DPD: Entraps LLM-agents in infinite or exhausting action cycles (e.g., file-system traversal loops), rapidly consuming their compute budget.
- Active DPD: Injects command payloads (e.g. reverse shell instructions), potentially allowing defenders to compromise the attacking agent’s infrastructure.
Empirical evaluation using weaponized pentest agents and multiple backend LLMs shows DPD yielding >95% defender win rate with negligible prior knowledge of the attacker's model internals. Techniques such as randomization of trigger payloads and hiding via control codes increase stealth and robustness.
This class of DPD extends the notion of dynamic prompt manipulation to adversarial “honeytrap” engagements, leveraging the same attack surface for defense.
References
- SecurityLingua: Efficient Defense of LLM Jailbreak Attacks via Security-Aware Prompt Compression (Li et al., 15 Jun 2025)
- A Multi-Agent LLM Defense Pipeline Against Prompt Injection Attacks (Hossain et al., 16 Sep 2025)
- Defense Against Prompt Injection Attack by Leveraging Attack Techniques (Chen et al., 1 Nov 2024)
- Hacking Back the AI-Hacker: Prompt Injection as a Defense Against LLM-driven Cyberattacks (Pasquini et al., 28 Oct 2024)
- Towards Transferable Defense Against Malicious Image Edits (Zhang et al., 16 Dec 2025)
- Attack for Defense: Adversarial Agents for Point Prompt Optimization Empowering Segment Anything Model (Liu et al., 23 Sep 2025)
- Online Learning Defense against Iterative Jailbreak Attacks via Prompt Optimization (Kaneko et al., 19 Oct 2025)