IPI-Centric Defense Frameworks
- IPI-centric defense frameworks are mechanisms designed to mitigate indirect prompt injection by securing the input, process, and intent phases of system operation.
- They employ diverse techniques—including detection, prompt engineering, fine-tuning, and structured system design (e.g., IPIGuard, MELON)—to achieve low attack success rates and robust utility.
- Extending across ML, cryptographic security, and cyber-intrusion detection, these frameworks highlight multi-domain applications and the challenge of countering adaptive adversarial strategies.
IPI-centric defense frameworks constitute a diverse class of mechanisms, models, and architectures designed to mitigate Indirect Prompt Injection (IPI) and related security risks in both machine learning and traditional cyber-physical environments. IPI refers to adversarial strategies where attackers corrupt intermediate data, signals, or tool outputs—thus hijacking execution or degrading privacy without directly exploiting the core program logic or model parameters. The class encompasses defenses for LLM agents, cryptographic devices, anomaly detection pipelines, and inference-privacy systems, unifying them by their shared focus on input, process, and intent as loci of intervention.
1. Formal Definitions, Threat Models, and Taxonomies
IPI attacks are formally defined as adversarial manipulations where the attacker controls intermediary information or signals consumed by an autonomous system or agent, causing it to deviate from its intended trajectory or to leak sensitive information. In LLM-based agent systems, the attacker injects instructions into tool return values such that the subsequent sequence of tool calls reflects a malicious rather than benign trajectory; the threat model focuses on black-box adversaries controlling static resources in the agent's environment, such as webpages or databases returned by callable tools (Ji et al., 19 Nov 2025, An et al., 21 Aug 2025).
A unifying five-dimensional taxonomy was developed to organize IPI-centric defense frameworks across domains (Ji et al., 19 Nov 2025):
| Dimension | Representative Values |
|---|---|
| Technical Paradigm | Detection, Prompt Engineering, Fine-tuning, System Design, Runtime Checking, Policy Enforcing |
| Intervention Stage | Pre-inference, Intra-inference, Post-inference |
| Model Access | White-box, Black-box |
| Explainability | Deterministic, Probabilistic |
| Automation Level | Full-automation, Semi-automation |
This taxonomy enables precise comparison and synthesis of frameworks designed for LLM agents, cyber-intrusion detection, cryptographic key generation, and inference-control in ML.
2. Framework Architectures in LLM Agent Security
IPI-centric defenses in LLM-agent settings fall into several categories, each corresponding to axes in the above taxonomy (Ji et al., 19 Nov 2025, An et al., 21 Aug 2025, Zhu et al., 7 Feb 2025):
- Detection-based strategies use auxiliary classifiers or guard-LLMs to inspect tool outputs for signatures of prompt injection (e.g., LlamaFirewall’s alignment check).
- Prompt engineering enforces declarative tool-use constraints or injects delimiters to contain the model’s attention (e.g., “Tool Filter” explicitly whitelists callable tools).
- Fine-tuning (e.g., DPO or preference-based) involves retraining an LLM to reject known attack patterns via binary classification or reinforcement learning.
- System design approaches, exemplified by IPIGuard, structure execution as a traversal over a planned Tool Dependency Graph (TDG), enforcing structural separation and pre-computation of all legitimate tool invocations. This class includes dual-agent or segregated-context architectures.
- Runtime Checking and Policy Enforcement apply LLM-judges, DSL-based access control, or static policy checking post-decision but pre-execution.
For example, IPIGuard (An et al., 21 Aug 2025) constructs a TDG where each node represents a tool call (with arguments), and edges capture explicit data dependencies. Execution strictly adheres to ; any attempt to invoke a non-planned tool is blocked:
This eliminates the primary vector for IPI-induced tool hijack by removing any opportunity for injected payloads to modify the set of allowed actions after planning.
MELON (Zhu et al., 7 Feb 2025) instead applies parallel masked re-execution: the system replays the agent on a neutralized version of the prompt/context and detects attacks by measuring semantic similarity (via embedding cosine similarity) between tool calls from original and masked runs. This detects if actions are more influenced by injected instructions than by the intended user objective.
Key performance data for these frameworks is summarized below (from AgentDojo benchmark, GPT-4o-mini, Important Instruction attack):
| Defense | ASR↓ | UA↑ | BU↑ |
|---|---|---|---|
| No Defense | 27.2% | 49.9% | 68.0% |
| Detector | 8.6% | 23.1% | 32.1% |
| Tool Filter | 4.9% | 55.2% | 64.9% |
| Spotlight | 22.3% | 53.7% | 65.5% |
| Sandwich | 9.4% | 51.0% | 60.2% |
| IPIGuard | 0.64% | 57.1% | 69.1% |
IPIGuard achieves sub-1% Attack Success Rate (ASR), maintaining near-oracle benign utility (BU) (An et al., 21 Aug 2025).
3. Broader IPI-Centric Defense Mechanisms Beyond LLMs
IPI-centric frameworks extend to multiple domains outside LLMs.
Privacy-Preserving ML: Targeted Inference Defense
Redactor (Heo et al., 2022) proposes an individualized, data-centric framework for mitigating inference attacks (e.g., membership inference) where the defender cannot delete or modify existing data nor retrain models. The approach synthesizes targeted “disinformation” records as close as possible to the attacker’s target in the input space, while ensuring their label is flipped with high probability according to a conservatively estimated Probabilistic Decision Boundary (PDB):
where is the target, is the PDB classifier ensemble, and is a confidence threshold. This produces instance-specific, minimally invasive defenses.
Cryptographic Security: IMD Randomness Extraction
For implantable medical devices, Chizari & Lupu (Chizari et al., 2018) define an IPI-centric cryptographic defense: rather than using raw Inter-Pulse Intervals (IPI) as a randomness source for key generation, which fails key unpredictability tests, they introduce Martingale Randomness Extraction from IPI (MRE-IPI) operating on trends (bit windows of successive IPI LSBs and a martingale structure), yielding a pseudo-random sequence with Shannon entropy approaching $0.9999$ and min-entropy per 16-bit block.
Cyber-Intrusion: Action-Intent Mapping
The Action-Intent Framework (AIF) (Moskal et al., 2020) systematizes alerts by classifying observables into “micro” (tactics/procedures) and “macro” (intent/impact) Action-Intent States, e.g., mapping a “SSH brute force” IDS alert to “Brute-Force Credential Access” (micro) and “Privilege Escalation” (macro). This supports IPI-centric defense pipelines by structuring the interpretability and triage of security observables.
Mimic Defense: Heterogeneity and Scheduling
SIDMD (Fu et al., 2022) demonstrates an IPI-centric mimic defense using input-aware clustering of vulnerabilities and dynamic, randomized scheduling across functionally redundant but heterogeneous program variants, empirically reducing attack success probability by up to compared to static redundancy.
Human Factors: Intimate Partner Infiltration
AID (Yang et al., 6 Feb 2025) extends IPI-centric defense into the human space, using on-device, multimodal sensing and autoencoder-based user modeling to continuously monitor and classify user actions, detecting unauthorized device access and intimate partner tampering with F1 scores up to $0.981$ and FPR as low as .
4. Root Cause Analysis, Attack Adaptation, and Limitations
A comprehensive SoK (Ji et al., 19 Nov 2025) identifies six recurring root causes of circumvention in LLM IPI defenses:
- Imprecise access control over tool selection
- Imprecise access control over tool parameters
- Incomplete isolation of malicious information
- Judgment errors in checking/detecting LLMs
- Inadequate coverage of security policies
- Poor generalization ability against unforeseen payloads
Adaptive attacks—such as semantic-masquerading (payload rewriting), cascading (conditional branching between judge/executor LLMs), and isolation-breach (leak of malicious context over error or side-channel flows)—raise ASRs up to 4.8× for some frameworks.
Frameworks like IPIGuard and MELON report residual rare failures due to inability to address overlapping tool invocations, input hallucination, or parameter-level manipulation not scoped by the graph/planner (An et al., 21 Aug 2025, Zhu et al., 7 Feb 2025).
5. System Architecture, Evaluation Practices, and Design Guidelines
A consensus emerges that robust IPI-centric defense frameworks require architectural separation between planning and execution, fine-grained policy enforcement (including argument-level checks), and explicit modeling/isolation of dynamically injected or adversary-controlled data.
Best practices, distilled from empirical results and failure analyses (Ji et al., 19 Nov 2025, An et al., 21 Aug 2025):
- Enumerate and pre-approve all tool dependencies before execution; use explicit DAGs for agent workflows.
- Isolate planning and execution phases to prevent propagation of attacker payload post-planning.
- Fuse structural (programmatic) with probabilistic (ML/LLM-based) checks for redundancy.
- Monitor latent vulnerabilities and adapt scheduling/instance selection dynamically (SIDMD model).
- Quantitatively benchmark utility, ASR, and latency/token overheads using public benchmark suites (e.g., AgentDojo).
The following table concisely collates common intervention paradigms and representative frameworks:
| Paradigm | Example Frameworks | Distinguishing Feature |
|---|---|---|
| System Design | IPIGuard (An et al., 21 Aug 2025), CaMeL | Graph- or IFC-based structural isolation |
| Detection | MELON (Zhu et al., 7 Feb 2025), LlamaFW | Masked re-execution, classifier guards |
| Policy Enforcing | Progent | DSL/firewall for declarative constraints |
| Mimic Defense | SIDMD (Fu et al., 2022) | Spatiotemporal redundancy, input-aware |
| Data-centric | Redactor (Heo et al., 2022) | Targeted, individualized data poisoning |
6. Extensions, Open Challenges, and Future Directions
Current IPI-centric defense frameworks emphasize structural and pipeline-level rigor; nonetheless, open questions remain regarding scalability to text-only, non-tool-based payloads (An et al., 21 Aug 2025), formal non-interference proofs for agent architectures, and automated generation/validation of wide-scope security policies (Ji et al., 19 Nov 2025).
Further research aims to combine hybrid defenses—incorporating adversarial training with structural payloads, formal taint tracking at parameter granularity, and auto-policy synthesis with human-in-the-loop validation—to address the remaining gaps in robustness and utility tradeoffs.
Empirical and theoretical progress in this area is anticipated to inform the next generation of secure, usable, and broadly generalizable IPI-centric defense frameworks, relevant both for AI-agent deployment and for critical cyber-physical infrastructure resilience.