Papers
Topics
Authors
Recent
2000 character limit reached

IntentGuard: Intent-Aligned AI Security

Updated 7 December 2025
  • IntentGuard is a framework for securing AI agents by enforcing user-intent-driven, fine-grained runtime permissions.
  • It employs task interpreters and policy engines to dynamically bind and revoke privileges based on explicit user goals.
  • By integrating LLM-based intent analysis and cryptographic checks, IntentGuard effectively mitigates prompt injection and over-privilege vulnerabilities.

IntentGuard refers to a class of access control and intent analysis frameworks designed to rigorously enforce intent-aligned, fine-grained runtime security over autonomous AI agents, particularly LLM-driven assistants and multi-modal systems. Distinct from legacy permissioning or simple prompt filtering, IntentGuard solutions sculpt the precise set of privileges or permissible model behaviors around the user’s high-level goal—denying all agent actions not demonstrably in service of that goal. This approach neutralizes instruction- and prompt-injection attacks and mitigates over-privilege, setting a new standard for safeguarding AI agents operating in open or adversarial environments (Cai et al., 30 Oct 2025, Kang et al., 30 Nov 2025, Li et al., 30 Oct 2024).

1. Threat Models and Motivation

IntentGuard systems are motivated by several intersecting threat models:

  • Instruction injection: Malicious actors embed harmful natural language commands in otherwise benign content (e.g., email, app UIs), aiming to hijack over-privileged agents into unauthorized actions such as exfiltrating emails or performing dangerous API calls (Cai et al., 30 Oct 2025).
  • Indirect prompt injection attacks (IPIAs): LLM-powered agents receive attacker-crafted instructions via untrusted data sources. The decisive factor for exploitation is whether the LLM intends to follow the injected, untrusted instruction rather than its mere presence (Kang et al., 30 Nov 2025).
  • Prompt/intent ambiguity and over-defense: Guard models may misclassify harmless prompts containing “trigger” words as malicious, leading to high false positive rates and diminished utility (Li et al., 30 Oct 2024).
  • Multimodal implicit attacks: In joint-modal settings, adversaries can encode unsafe instructions across multiple modalities (e.g., text and image), such that only their combination triggers malicious behavior (Zhang et al., 20 Oct 2025).

Legacy permission models—such as static, app-level OS grants—are not able to distinguish user-intended actions from injected or unintended agent behaviors, creating persistent vulnerabilities.

2. Task-Centric Access Control: Mechanisms and Formalization

IntentGuard implements a task-centric, just-in-time access control paradigm that binds runtime permissions to the user’s explicit task intent, rather than to the agent’s identity or broad session-level privileges (Cai et al., 30 Oct 2025). The defining architectural components include:

  • Task Interpreter: Parses high-level natural language requests into a structured TaskContext specifying the goal and permissible resources (e.g., SignUp(AppX), allowed email domain).
  • Policy Generation Engine (PGE): Associates TaskContexts with parameterized policy templates, yielding a PolicySet of minimal, temporary permissions.
  • Policy Enforcement Point (PEP) and Policy Decision Point (PDP): Mediate all agent actions, mapping each low-level request (AgentID,Resource,Operation,Context)(\text{AgentID}, \text{Resource}, \text{Operation}, \text{Context}) against the active PolicySet via a default-deny rule.

A formal policy rule is a tuple:

π=(agent,resource,operation,context){Allow,Deny}\pi = (\text{agent}, \text{resource}, \text{operation}, \text{context}) \rightarrow \{\text{Allow}, \text{Deny}\}

At action time, the PDP executes:

Decision(a)={Permitif πActivePolicySet matching aπAllow Denyotherwise\text{Decision}(a) = \begin{cases} \text{Permit} & \text{if } \exists\,\pi \in \text{ActivePolicySet} \text{ matching } a \wedge \pi \rightarrow \text{Allow} \ \text{Deny} & \text{otherwise} \end{cases}

Permissions are revoked upon task completion, sharply limiting the window of exposure and scope of agent actions. Empirical results demonstrate 100% block rates on synthetic exfiltration attempts, preserving full completion on authorized tasks with <<10 ms decision latency per intercepted agent action on Android environments (Cai et al., 30 Oct 2025).

3. Intent Analysis and LLM-based Guarding

For threats involving indirect prompt injection or semantically nuanced intent substrate, runtime permission interception alone is insufficient. Contemporary IntentGuard systems integrate intent reasoning LLMs, casting the defense as a combination of:

  • Chain-of-thought (CoT) intent extraction: The LLM enumerates all latent instructions it intends to execute, guided by explicit “thinking” interventions and in-context demonstrations (Kang et al., 30 Nov 2025).
  • Origin tracing: For each extracted intent i^\hat{i}, a matching algorithm computes whether its semantic origin lies in a trusted or untrusted segment, using windowed embedding similarity with default threshold (e.g., sim0.7\text{sim} \ge 0.7).
  • Conditional mitigation: If i^\hat{i} traces to untrusted input, the system either alerts the user or suppresses agent execution for that instruction.

Notably, multiple “thinking” interventions (start-of-thinking lists, end-of-thinking refinement, adversarial demonstration) maximize the faithfulness and robustness of intent extraction. Under these strategies, attack success rates (ASR) shrink to 0.05–0.11 on challenging benchmarks like AgentDojo and Mind2Web, compared to baselines up to 1.0 (Kang et al., 30 Nov 2025).

4. Guard Models: Over-Defense, Debiasing, and Query Refinement

Prompt guard models deployed as intent analysis layers are subjected to distinct challenges:

  • Over-defense: Binary or keyword-sensitive detectors over-reject benign queries containing “risky” words, with state-of-the-art guardrails achieving <60%<60\% benign accuracy on the NotInject benchmark (Li et al., 30 Oct 2024).
  • Mitigating Over-defense for Free (MOF): Post-training, models are stress-tested for token-level false-positive bias, followed by synthetic generation and inclusion of benign samples containing those “trigger tokens.” A retraining phase incorporating these MOF samples elevates over-defense and overall performance (e.g., InjecGuard: 87.3% over-defense accuracy vs. 56.6% for baselines) (Li et al., 30 Oct 2024).
  • Multi-level classification and rewriting: Recent intent-guarding architectures (e.g., IntentionReasoner) combine chain-of-thought intent analysis, fine-grained four-level safety labels (CU/BU/BH/CH), and query rewriting to excise borderline harmful substance while preserving benign user queries (Shen et al., 27 Aug 2025). Models trained with supervision and RL achieve F1 scores up to 99.4, ASR below 1.2%, and near-zero over-refusal rates.

5. Extending to Multimodal and Encrypted Contexts

The evolution of multimodal models and integration into complex application stacks requires further adaptation:

  • Multimodal (Joint-Modal) Implicit Attacks: Systems such as CrossGuard, trained with adversarially constructed implicit samples (via ImpForge), demonstrate that a cross-modal intent classifier (e.g., modified LLaVA-1.5-7B) can achieve attack success rates (ASR) on implicit benchmarks as low as 5.39%, compared to >48% for general-purpose models (Zhang et al., 20 Oct 2025). CrossGuard achieves utility pass rates above 90% on benign VQA data, establishing a strong security-utility tradeoff through integrated binary cross-entropy heads on joint embeddings.
  • Cryptographically-enforced permissions: Complementing runtime intent checks, Encrypted Prompt frameworks append cryptographically signed permission payloads to each user prompt. Only when model-suggested actions are explicitly authorized by a verified, signed permission block are they executed, eliminating the risk of prompt-injected or misaligned actions at the application boundary (Chan, 29 Mar 2025). This separation between "what the LLM is allowed to do" and "what the LLM chooses to do" is robust to direct/indirect prompt injection and tampering.

6. Evaluation, Limitations, and Future Directions

Evaluation methodologies for IntentGuard encompass attack success rate (ASR), false positive/over-refusal rate (FPR/ORR), utility preservation, and per-decision latency. Task-centric frameworks block all out-of-scope attempts in demonstrator studies, while LLM-based intent analysis sharply reduces ASR with no degradation to authorized workflows (Cai et al., 30 Oct 2025, Kang et al., 30 Nov 2025).

Limitations and future considerations include:

  • Ambiguous or multi-task intents: Mapping complex, ambiguous user goals to strict TaskContexts is nontrivial and may require advanced UIs or compositional policy templates (Cai et al., 30 Oct 2025).
  • Coverage of new attack modalities: Static policy libraries and supervised intent models may lag behind novel, highly obfuscated multimodal or cross-lingual attack patterns; adaptive red-teaming and continual learning loops are essential (Zhang et al., 20 Oct 2025).
  • Model-dependency: Certain interventions (IIA, CoT) rely on reasoning-capable LLMs and may underperform on models lacking explicit chain-of-thought competence (Kang et al., 30 Nov 2025).
  • Usability and extension: Task-centric permissioning was shown not to disrupt user experience mid-task, but ongoing trade-off analysis is necessary as agent interaction complexity grows (Cai et al., 30 Oct 2025).

Potential extensions involve integrating intent analysis at model pre-training or RLHF stages, extending to multi-agent or multi-device orchestrations, and incorporating human-in-the-loop for low-confidence or high-variance cases.

7. Comparison of Architectures and Empirical Results

Defense Mechanism Attack Block Rate (ASR↓) Over-refusal Rate (FPR/ORR) Utility Preservation Key Features
Task-centric IntentGuard (Cai et al., 30 Oct 2025) 100% (synthetic exfiltration) 0% Full completion Per-task, revocable policy
LLM Intent Analysis (IIA) (Kang et al., 30 Nov 2025) <10% (adaptive attacks) 0% Utility ≈ vanilla Instruction origin tracing
IntentionReasoner (Shen et al., 27 Aug 2025) 1.2–1.5% (auto-jailbreak) ≈0% SOTA F1, utility Multi-level, rewriting
Cryptographic Prompt Guard (Chan, 29 Mar 2025) 100% against out-of-permission API 0% Not task aware Cryptographic enforcement
InjecGuard (MOF) (Li et al., 30 Oct 2024) SOTA (explicit) 87.3% on NotInject High Data-driven, over-defense mitigation
Multimodal CrossGuard (Zhang et al., 20 Oct 2025) 2.79% (all), 5.39% (implicit) Not reported UPR > 90% Joint-modal, intent embeddings

Systems are selected, tuned, and evaluated according to dataset, threat model, and application environment. The results highlight the principle that least-privilege, intent-aligned, and model- or cryptographically-enforced restrictions robustly contain both direct and subtle emergent attacks, without incurring significant usability costs.


IntentGuard thus reconceptualizes agent security from broad, session-centric privilege to task-aligned, just-in-time action—and, in higher-order frameworks, from syntactic prompt sanitization to model-internal semantic intent tracing. This shift addresses critical vulnerabilities in both legacy and modern AI deployments and establishes the foundation for safe, intent-aligned deployment of autonomous systems across domains (Cai et al., 30 Oct 2025, Kang et al., 30 Nov 2025).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to IntentGuard.