IPI-Centric Defense Frameworks

Updated 26 November 2025

IPI-centric defense frameworks are mechanisms designed to mitigate indirect prompt injection by securing the input, process, and intent phases of system operation.
They employ diverse techniques—including detection, prompt engineering, fine-tuning, and structured system design (e.g., IPIGuard, MELON)—to achieve low attack success rates and robust utility.
Extending across ML, cryptographic security, and cyber-intrusion detection, these frameworks highlight multi-domain applications and the challenge of countering adaptive adversarial strategies.

IPI-centric defense frameworks constitute a diverse class of mechanisms, models, and architectures designed to mitigate Indirect Prompt Injection (IPI) and related security risks in both machine learning and traditional cyber-physical environments. IPI refers to adversarial strategies where attackers corrupt intermediate data, signals, or tool outputs—thus hijacking execution or degrading privacy without directly exploiting the core program logic or model parameters. The class encompasses defenses for LLM agents, cryptographic devices, anomaly detection pipelines, and inference-privacy systems, unifying them by their shared focus on input, process, and intent as loci of intervention.

1. Formal Definitions, Threat Models, and Taxonomies

IPI attacks are formally defined as adversarial manipulations where the attacker controls intermediary information or signals consumed by an autonomous system or agent, causing it to deviate from its intended trajectory or to leak sensitive information. In LLM-based agent systems, the attacker injects instructions into tool return values $o_i$ such that the subsequent sequence of tool calls $\tau$ reflects a malicious rather than benign trajectory; the threat model focuses on black-box adversaries controlling static resources in the agent's environment, such as webpages or databases returned by callable tools (Ji et al., 19 Nov 2025, An et al., 21 Aug 2025).

A unifying five-dimensional taxonomy was developed to organize IPI-centric defense frameworks across domains (Ji et al., 19 Nov 2025):

Dimension	Representative Values
Technical Paradigm	Detection, Prompt Engineering, Fine-tuning, System Design, Runtime Checking, Policy Enforcing
Intervention Stage	Pre-inference, Intra-inference, Post-inference
Model Access	White-box, Black-box
Explainability	Deterministic, Probabilistic
Automation Level	Full-automation, Semi-automation

This taxonomy enables precise comparison and synthesis of frameworks designed for LLM agents, cyber-intrusion detection, cryptographic key generation, and inference-control in ML.

2. Framework Architectures in LLM Agent Security

IPI-centric defenses in LLM-agent settings fall into several categories, each corresponding to axes in the above taxonomy (Ji et al., 19 Nov 2025, An et al., 21 Aug 2025, Zhu et al., 7 Feb 2025):

Detection-based strategies use auxiliary classifiers or guard-LLMs to inspect tool outputs for signatures of prompt injection (e.g., LlamaFirewall’s alignment check).
Prompt engineering enforces declarative tool-use constraints or injects delimiters to contain the model’s attention (e.g., “Tool Filter” explicitly whitelists callable tools).
Fine-tuning (e.g., DPO or preference-based) involves retraining an LLM to reject known attack patterns via binary classification or reinforcement learning.
System design approaches, exemplified by IPIGuard, structure execution as a traversal over a planned Tool Dependency Graph (TDG), enforcing structural separation and pre-computation of all legitimate tool invocations. This class includes dual-agent or segregated-context architectures.
Runtime Checking and Policy Enforcement apply LLM-judges, DSL-based access control, or static policy checking post-decision but pre-execution.

For example, IPIGuard (An et al., 21 Aug 2025) constructs a TDG $G=(V,E)$ where each node $v\in V$ represents a tool call (with arguments), and edges $E$ capture explicit data dependencies. Execution strictly adheres to $G$ ; any attempt to invoke a non-planned tool is blocked:

$\mathrm{Guard}(u,v) = \begin{cases} \text{true} & (u,v)\in E, \ \text{false} & \text{otherwise}. \end{cases}$

This eliminates the primary vector for IPI-induced tool hijack by removing any opportunity for injected payloads to modify the set of allowed actions after planning.

MELON (Zhu et al., 7 Feb 2025) instead applies parallel masked re-execution: the system replays the agent on a neutralized version of the prompt/context and detects attacks by measuring semantic similarity (via embedding cosine similarity) between tool calls from original and masked runs. This detects if actions are more influenced by injected instructions than by the intended user objective.

Key performance data for these frameworks is summarized below (from AgentDojo benchmark, GPT-4o-mini, Important Instruction attack):

Defense	ASR↓	UA↑	BU↑
No Defense	27.2%	49.9%	68.0%
Detector	8.6%	23.1%	32.1%
Tool Filter	4.9%	55.2%	64.9%
Spotlight	22.3%	53.7%	65.5%
Sandwich	9.4%	51.0%	60.2%
IPIGuard	0.64%	57.1%	69.1%

IPIGuard achieves sub-1% Attack Success Rate (ASR), maintaining near-oracle benign utility (BU) (An et al., 21 Aug 2025).

3. Broader IPI-Centric Defense Mechanisms Beyond LLMs

IPI-centric frameworks extend to multiple domains outside LLMs.

Privacy-Preserving ML: Targeted Inference Defense

Redactor (Heo et al., 2022) proposes an individualized, data-centric framework for mitigating inference attacks (e.g., membership inference) where the defender cannot delete or modify existing data nor retrain models. The approach synthesizes targeted “disinformation” records as close as possible to the attacker’s target in the input space, while ensuring their label is flipped with high probability according to a conservatively estimated Probabilistic Decision Boundary (PDB):

$\min_{d_1,\dots,d_{N_d}} \sum_{j=1}^{N_d} \|d_j - t\|_2^2 \quad \text{s.t. } \max_{c\neq c_t} M_c(d_j) \geq \alpha$

where $t$ is the target, $M_c$ is the PDB classifier ensemble, and $\alpha$ is a confidence threshold. This produces instance-specific, minimally invasive defenses.

Cryptographic Security: IMD Randomness Extraction

For implantable medical devices, Chizari & Lupu (Chizari et al., 2018) define an IPI-centric cryptographic defense: rather than using raw Inter-Pulse Intervals (IPI) as a randomness source for key generation, which fails key unpredictability tests, they introduce Martingale Randomness Extraction from IPI (MRE-IPI) operating on trends (bit windows of successive IPI LSBs and a martingale structure), yielding a pseudo-random sequence with Shannon entropy approaching $0.9999$ and min-entropy $>0.93$ per 16-bit block.

Cyber-Intrusion: Action-Intent Mapping

The Action-Intent Framework (AIF) (Moskal et al., 2020) systematizes alerts by classifying observables into “micro” (tactics/procedures) and “macro” (intent/impact) Action-Intent States, e.g., mapping a “SSH brute force” IDS alert to “Brute-Force Credential Access” (micro) and “Privilege Escalation” (macro). This supports IPI-centric defense pipelines by structuring the interpretability and triage of security observables.

Mimic Defense: Heterogeneity and Scheduling

SIDMD (Fu et al., 2022) demonstrates an IPI-centric mimic defense using input-aware clustering of vulnerabilities and dynamic, randomized scheduling across functionally redundant but heterogeneous program variants, empirically reducing attack success probability by up to $14\times$ compared to static redundancy.

Human Factors: Intimate Partner Infiltration

AID (Yang et al., 6 Feb 2025) extends IPI-centric defense into the human space, using on-device, multimodal sensing and autoencoder-based user modeling to continuously monitor and classify user actions, detecting unauthorized device access and intimate partner tampering with F1 scores up to $0.981$ and FPR as low as $1.6\%$ .

4. Root Cause Analysis, Attack Adaptation, and Limitations

A comprehensive SoK (Ji et al., 19 Nov 2025) identifies six recurring root causes of circumvention in LLM IPI defenses:

Imprecise access control over tool selection
Imprecise access control over tool parameters
Incomplete isolation of malicious information
Judgment errors in checking/detecting LLMs
Inadequate coverage of security policies
Poor generalization ability against unforeseen payloads

Adaptive attacks—such as semantic-masquerading (payload rewriting), cascading (conditional branching between judge/executor LLMs), and isolation-breach (leak of malicious context over error or side-channel flows)—raise ASRs up to 4.8× for some frameworks.

Frameworks like IPIGuard and MELON report residual rare failures due to inability to address overlapping tool invocations, input hallucination, or parameter-level manipulation not scoped by the graph/planner (An et al., 21 Aug 2025, Zhu et al., 7 Feb 2025).

5. System Architecture, Evaluation Practices, and Design Guidelines

A consensus emerges that robust IPI-centric defense frameworks require architectural separation between planning and execution, fine-grained policy enforcement (including argument-level checks), and explicit modeling/isolation of dynamically injected or adversary-controlled data.

Best practices, distilled from empirical results and failure analyses (Ji et al., 19 Nov 2025, An et al., 21 Aug 2025):

Enumerate and pre-approve all tool dependencies before execution; use explicit DAGs for agent workflows.
Isolate planning and execution phases to prevent propagation of attacker payload post-planning.
Fuse structural (programmatic) with probabilistic (ML/LLM-based) checks for redundancy.
Monitor latent vulnerabilities and adapt scheduling/instance selection dynamically (SIDMD model).
Quantitatively benchmark utility, ASR, and latency/token overheads using public benchmark suites (e.g., AgentDojo).

The following table concisely collates common intervention paradigms and representative frameworks:

Paradigm	Example Frameworks	Distinguishing Feature
System Design	IPIGuard (An et al., 21 Aug 2025), CaMeL	Graph- or IFC-based structural isolation
Detection	MELON (Zhu et al., 7 Feb 2025), LlamaFW	Masked re-execution, classifier guards
Policy Enforcing	Progent	DSL/firewall for declarative constraints
Mimic Defense	SIDMD (Fu et al., 2022)	Spatiotemporal redundancy, input-aware
Data-centric	Redactor (Heo et al., 2022)	Targeted, individualized data poisoning

6. Extensions, Open Challenges, and Future Directions

Current IPI-centric defense frameworks emphasize structural and pipeline-level rigor; nonetheless, open questions remain regarding scalability to text-only, non-tool-based payloads (An et al., 21 Aug 2025), formal non-interference proofs for agent architectures, and automated generation/validation of wide-scope security policies (Ji et al., 19 Nov 2025).

Further research aims to combine hybrid defenses—incorporating adversarial training with structural payloads, formal taint tracking at parameter granularity, and auto-policy synthesis with human-in-the-loop validation—to address the remaining gaps in robustness and utility tradeoffs.

Empirical and theoretical progress in this area is anticipated to inform the next generation of secure, usable, and broadly generalizable IPI-centric defense frameworks, relevant both for AI-agent deployment and for critical cyber-physical infrastructure resilience.