Papers
Topics
Authors
Recent
Search
2000 character limit reached

AI Agents May Always Fall for Prompt Injections

Published 17 May 2026 in cs.CR, cs.CL, and cs.CY | (2605.17634v1)

Abstract: Prompt injection is the most critical vulnerability in deployed AI agents. Despite recent progress, we show that the prevailing defense paradigm (data-instruction separation) both fails to detect attacks that operate through contextual manipulation and degrades contextually appropriate behavior. We then recast prompt injection via the lens of Contextual Integrity (CI), a privacy theory that judges information flow compliance with contextual norms. This explains types of attacks that current defenses attempt to patch and predict advanced ones future agents will face. We develop unique benign and attack scenarios that force an agent to violate the norms by (1) misrepresenting the flow, (2) manipulating norms, or (3) mixing multiple flows. This reframing suggests an impossibility result: an adversary can always construct a context under which a blocked flow appears legitimate, or a defender who tightens norms will block genuinely legitimate flows. Our findings suggest that current research addresses a shrinking fraction of future attack surfaces. Instead, through CI, we offer a principled framework for evaluating context-sensitive failures, and designing CI-aware alignment for the frontier autonomous agents.

Summary

  • The paper demonstrates that AI agents remain vulnerable to context-based prompt injections despite conventional defenses based on data–instruction separation.
  • Empirical evaluations reveal that standard injection classifiers perform near chance (AUROC 0.43–0.59), compromising both security and utility.
  • The authors present an impossibility argument showing that static defense strategies necessarily trade off between blocking attacks and enabling legitimate actions.

Contextual Integrity and the Persistent Challenge of Prompt Injection in AI Agents

Introduction

The paper "AI Agents May Always Fall for Prompt Injections" (2605.17634) presents a critical reevaluation of prompt injection vulnerability in autonomous language-model-driven agents. It demonstrates that the prevailing defense paradigm—data-instruction separation—fails to prevent context-based attacks and can degrade the agent's ability to perform contextually appropriate actions. Leveraging the theory of Contextual Integrity (CI), the authors systematically analyze the failure modes of existing approaches, introduce an impossibility argument regarding defense completeness, and provide empirical evidence corroborating their claims. The implications are significant for the design, training, and evaluation of future AI agents, particularly those operating in high-stake, multi-party contexts. Figure 1

Figure 1: A schematic depicting the limitations of data/code boundary defense strategies versus a CI-based decomposition that enables dimensional, context-sensitive probing and defense.

Critique of Data-Instruction Separation Paradigm

The canonical defense against prompt injection assumes a discernible boundary between "data" and "instructions," aiming to prevent model actions driven by untrusted context. The authors argue that this model is fundamentally flawed for agentic LLMs, which must interpret every interaction—emails, tool outputs, memory logs—as both data and potential instruction. As LLM agents increasingly operate in open domains with ambiguous or multi-source context, isolating instructions becomes neither semantically meaningful nor practically sufficient.

Experimental evaluation using a paired email dataset (4,200 pairs) illustrates that injection classifiers perform at near-chance (AUROC 0.43–0.59) in distinguishing benign from contextually manipulative emails, as both variants share high surface similarity and lack overt injection tokens. Figure 2

Figure 2: (a) Classifiers fail to distinguish near-identical emails differing only in implied authority, (c) safety fine-tuning (SecAlign) increases unauthorized send rates, and (d) suppresses the model’s ability to report true violations.

Notably, enforcement of stricter data/code barriers not only fails to block context-based attacks, but also reduces utility by suppressing legitimate actions (see the degradation in contextual appropriateness after safety-oriented finetuning).

Contextual Integrity: Theoretical Framework

CI posits that the appropriateness of information flows is defined by compliance with entrenched contextual norms. For language agents, this reframes the problem: actions are not authorized or unauthorized based on syntactic cues, but through multi-dimensional contextual relationships, including sender/receiver identity, information type, subject, and transmission principle (delegation).

A key argument is that attacks and legitimate actions often differ solely by values for these parameters, which the agent must infer from possibly ambiguous or manipulated context. If the agent over-restricts, it blocks legitimate flows; if it over-permits, it succumbs to sophisticated prompt injections.

Empirical Evaluation: Red-Teaming and Norm Failures

The authors introduce a CI-grounded red-teaming methodology, demonstrating the inadequacy of both static classifiers and safety-finetuned models against context-based attacks and value manipulation. Figure 3

Figure 3: Mapping of ConVerse-style agent-to-agent attacks onto the CI decomposition, showing how privacy and security violations manifest across different CI dimensions.

Figure 4

Figure 4: Iterative attacks that fabricate delegation authority (left) or use persuasive arguments to manipulate norm evaluation (right), achieving high attack success rates.

An adaptive iterative attacker achieves a 96.7% success rate in eliciting unauthorized send actions by progressively crafting emails that embed subtle, plausible delegation claims—specifically by forging first-party user quotes in forwarded threads. Notably, these attacks transfer across model families: unoptimized attacks created for claude-sonnet-4-6 succeed on gpt-5.2/5.4 and gemini-3-pro at rates of 63–93%. This transferability highlights the depth of the issue, not just model idiosyncrasy.

On the value manipulation front, even when context parameter manipulation is forbidden, attacks that exploit cost asymmetries, welfare appeals, and role normalization achieve up to 79% success against gpt-5.2. Figure 5

Figure 5: Before and after red-team attack: the agent’s reasoning shifts from correctly identifying authorization as third-party to wrongly interpreting a fabricated quote as genuine first-party instruction.

Generalization Failures: History and Flow Separation

Experiments further reveal that even absent adversarial manipulation, agents overgeneralize delegation when deprived of adequate grounding history. Execution rates for out-of-scope requests—with no prior history—are 30–36%. This drops substantially (to as low as 7%) once agents accumulate in-scope history, underscoring the role of interaction history in norm inference. Figure 6

Figure 6: (c) Over-compliance in the absence of grounding history, and (d) leakage between authorized and unauthorized flows in a single thread.

When multiple information flows—subject to different delegations—are embedded in a single thread, agents often collapse their boundaries: in explicit tests, gpt-5.2 improperly executes unauthorized flows in 65% of scenarios unless explicit boundaries are reiterated.

The Impossibility Argument

The core formal claim is that no static policy or defense can guarantee perfect context-based prompt injection resistance while maintaining utility. For each parameter, an adversary can construct a plausible context (e.g., fabricating a prior delegated reply) that, if genuine, would make the action normatively appropriate. Defenders cannot universally verify the authenticity of context at inference time. Efforts to block all suspicious claims block genuine flows; efforts to allow all plausible claims permit attacks. The tradeoff is irreducible without external, often impractical, attestation or verification.

Implications and Future Directions

The implications are consequential for the design of autonomous agents:

  • Training and Red-Teaming: CI-grounded scenario pairs, beyond surface-form injection benchmarks, are necessary to train agents' context-sensitive reasoning. Reward models and RL pipelines must factor norms not derivable from static dataset pairs.
  • Defense Architecture: System-level verification (e.g., digital authentication of delegation) must complement agentic reasoning. No model-only solution suffices; layered approaches are required.
  • Security-Utility Tradeoff: Evaluation of agent security cannot be separated from utility; actions to improve resistance to context-based attacks (e.g., over-restrictive boundaries) invariably degrade helpfulness in real deployments.

The CI lens offers a categorical structure for summarizing, benchmarking, and red-teaming failure modes. This is especially important as agents increasingly participate in multi-agent, long-horizon, cross-context scenarios (e.g., workflow orchestration, inter-department communications) where delegation, context, and flows are ambiguous, intertwined, and frequently unverifiable.

Conclusion

This work conclusively demonstrates that prompt injection—under a CI-theoretic and context-manipulation perspective—remains an unavoidable vulnerability for current and foreseeable LLM agents. Empirical results reveal fundamental limitations in existing detection and alignment paradigms, highlight the necessity of dynamic, context-sensitive evaluation frameworks, and provide an impossibility result for the absolute prevention of prompt injection without sacrificing utility. The research advocates for red-teaming and defense strategies that are theory-led, adaptable, and contextually grounded, establishing a new baseline for both practical deployment and scientific inquiry in secure, trustworthy autonomous agents.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

Overview

This paper asks a simple but important question: can we ever make AI “agents” (like email helpers or web assistants) fully safe from being tricked by bad messages? The authors argue that today’s main way of defending these agents—trying to separate “data” from “instructions”—doesn’t really work in the real world. Instead, they suggest a new way to think about safety based on “Contextual Integrity,” which basically means judging whether an action is okay by looking at the whole situation, not just the words.

The big questions the paper explores

  • Can AI agents reliably tell when a message is trying to trick them if the trick is hidden in the situation, not just in the words?
  • Do current defenses (like classifiers that spot “prompt injections” or safety-tuned models) help, or can they actually make things worse?
  • Is there a better way to think about the problem that fits real-life use, where what’s appropriate depends on context?
  • Are there cases where it’s impossible to block every bad action without also blocking some good, necessary actions?

How the researchers studied it (in everyday terms)

Think of an AI email assistant that can read emails and sometimes send replies for you.

  • They built “paired” emails: two emails that look almost the same, except one has a tiny extra line like “this has already been approved,” while the other keeps things neutral. There’s no obvious “ignore all instructions” kind of trick—just a subtle claim about what’s allowed.
  • They tested two kinds of user settings:
    • Restrictive: the user only authorizes reading/summarizing, not sending emails.
    • Permissive: the user authorizes the assistant to take actions like sending emails.
  • They checked if popular “prompt injection detectors” could tell the difference between harmless vs. sneaky emails that make authority claims.
  • They compared a regular model and a safety-tuned model to see how often they send emails when they shouldn’t, and how well they flag suspicious content.
  • They ran a “red team” attack loop, where another AI tries to trick the assistant by crafting emails that look legitimate, like forwarding a supposed message from the user or quoting a fake approval.
  • They studied two common real-world challenges:
    • History grounding: Does the assistant behave better when it has a bit of interaction history that shows what’s in scope vs. out of scope?
    • Mixed flows: What happens when a single email thread includes two requests—one that’s allowed and another that isn’t? Will permission for one “leak” into the other?

Key idea explained: Contextual Integrity (CI)

  • CI says: whether sharing info or taking an action is okay depends on the situation—who said it, who it’s for, whose info it is, what kind it is, and under what rules or permissions.
  • The authors use CI to break down the problem into these parts and show how attackers can twist each one.

What they found and why it matters

  • Prompt injection detectors miss context-based tricks.
    • When the “attack” is in the situation (like a subtle claim of approval) instead of in obvious attack words, classifiers performed near chance. That means they couldn’t reliably tell sneaky from normal emails.
  • Safety-tuned models can overreact—and make things worse.
    • In restrictive mode (no sending), the safety-tuned model actually sent more emails than the regular one. It also flagged fewer truly suspicious cases when asked to report them. So “safety training” can sometimes reduce both safety and usefulness.
  • CI-based attacks work very well.
    • A red-team AI that tweaks context (like forging a quoted “from the user” message in a forwarded email) got the assistant to send emails in 96.7% of cases—even when sending was not authorized.
    • These attacks didn’t rely on weird suffixes or obvious “ignore all instructions” tricks. They looked legitimate and transferred to different AI models fairly well.
  • Agents need grounding history to draw boundaries.
    • Without a few examples of what’s in-scope, agents accepted out-of-scope requests much more often. With short interaction history, they got stricter and safer while still staying useful.
  • Mixed requests cause “permission leakage.”
    • If an email includes two requests—one allowed (like updating an internal tracker) and one not allowed (like emailing an external contact)—agents often mixed them up and did both, especially if no explicit “boundary rules” were stated.
  • Impossibility tradeoff.
    • The authors argue there’s a built-in tradeoff: if you always block flows that might be dangerous, you will sometimes block legitimate ones too. If you allow flows that look legitimate, a smart attacker can make a bad flow look legitimate. In other words, you can’t write a one-size-fits-all “never do X” rule that works in every situation.

Why this is important

  • Real agents work in messy situations. They read emails, browse the web, and use tools. In these settings, there’s no clear line between “data” and “instruction”—because context itself is instructive.
  • Attackers don’t need to say “ignore your rules.” They can make the situation look like the right rules already allow the action.
  • If we only block obvious attack words, we’ll miss the more realistic tricks.

What this means going forward

  • Use Contextual Integrity to guide defenses. Instead of looking just for dangerous phrases, agents should check the situation: who’s asking, who’s affected, what’s the action, and under what permission.
  • Layered defenses are needed.
    • Model-level reasoning: Teach agents to reason about context and norms (e.g., “Do I really have permission to do this?”).
    • System-level verification: Check claims against ground truth when possible (e.g., signed senders, policy databases, user confirmation).
  • Red-team for context, not just keywords. Tests should try to manipulate the situation, not only inject obvious “ignore instructions” lines.
  • Build and use history carefully. Short, clean interaction history helps the agent learn what’s in-scope—though it also means we need to protect that history from being corrupted.
  • Expect tradeoffs. Perfectly safe and perfectly helpful at all times may not be possible. The goal is to make smarter, context-aware judgments and verify what can be verified.

In short: the paper shows that many future attacks won’t look like simple “hacks in text.” They’ll look like normal messages that quietly change the situation. Defending against that means teaching AI agents to understand context, verify claims when possible, and be trained and tested with those real-life complexities in mind.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, concise list of what remains missing, uncertain, or unexplored in the paper, framed as concrete and actionable items for future research.

  • Formalize the “impossibility” claim: provide a rigorous theorem that specifies assumptions, attack/defense models, and minimal verification primitives needed to bound risk; quantify the utility–security tradeoff curve under those assumptions.
  • Define measurable CI parameter extraction tasks: operationalize sender/receiver/subject/information type/transmission principle extraction with gold labels; report precision/recall and confusion among parameters across models.
  • Build CI-grounded benchmarks beyond email: extend evaluations to web agents, code execution environments, databases/CRMs, and physical-world tools (e.g., scheduling, finance, deployment), including irreversible action settings with risk-weighted metrics.
  • Evaluate multi-agent scenarios empirically: move from illustrative mappings to controlled experiments with agent-to-agent communication, negotiated tasks, and coalition behaviors, measuring CI violations across agents.
  • Test long-horizon and memory-poisoning settings: study how early-context anchoring and memory corruption alter inferred norms over days/weeks; measure persistence and exploitability of poisoned norms.
  • Cross-lingual and cross-cultural norms: assess CI parameter inference and norm judgments across languages and cultural/organizational contexts; quantify variation and disagreement and how to calibrate to local norms.
  • Human-in-the-loop validation: complement LLM judges with expert human adjudication of CI violations and delegation boundaries to establish ground truth and inter-rater reliability.
  • Broaden and standardize baselines: compare more prompt-injection detectors (beyond Prompt Guard variants) and safety-tuning methods; report PR curves, calibration, and deployment-relevant thresholds, not just AUROC.
  • Control for safety-training confounders: isolate which training choices (data, objectives, prompts) cause degraded engagement/utility; replicate across multiple “secure” LLMs rather than a single SecAlign instance.
  • Richer task utility and safety metrics: move beyond “send_email” to include task completion rates, latency, user satisfaction, and cost of false positives/negatives under different risk budgets.
  • Flow-separation algorithms: propose and evaluate concrete methods for parsing and isolating simultaneous information flows within a message/thread (e.g., structured parsing, role attribution, per-flow gating).
  • Grounding-history mechanisms: design and test protocols to collect, summarize, and verify a user’s delegation history safely (e.g., signed delegation updates, provenance-tagged memory), and quantify how much history is needed to reduce over-compliance.
  • Provenance and attestation integrations: prototype cryptographic sender verification, policy attestation, and tamper-evident logs; evaluate coverage gaps, latency, and residual attack surfaces when verification is partial.
  • Norm-evaluation modeling: operationalize the CI evaluative heuristic (interests/values/context functions) into explicit criteria, rubrics, or reward models that can be trained (e.g., RL from CI-derived rewards), and test consistency across scenarios.
  • Defense-layer architectures: implement the proposed layered approach (model-level CI reasoning + system-level verification) end-to-end; measure failure modes, overhead, and how often verification overturns model judgments.
  • Red-team generalization: assess whether CI-crafted attacks transfer across model families, tasks, and tools; characterize which CI-parameter manipulations are most universally effective and why.
  • Realistic organizational boundaries: replace simple domain-based “internal vs external” heuristics with directory/graph-based org models; evaluate whether flow separation and boundary inference improve under richer structure.
  • Data realism and release: ensure the paired-email and multi-turn datasets reflect real organizational communications (style, metadata, headers, threading); release full corpora, prompts, and labels to enable reproducibility.
  • Adversarial constraints compliance: enforce and audit constraints in “norm-evaluation-only” attacks to isolate pure persuasive strategies; quantify leakage into parameter manipulation and refine attack generators accordingly.
  • Economic risk modeling: couple CI violations to downstream harm/cost models (e.g., financial loss, privacy penalties) to prioritize defenses where the expected harm is highest.
  • Adaptive delegation elicitation: design UX/protocols to elicit precise, evolving user delegation (e.g., granular policies, contextual confirmations) without overburdening users; quantify effects on utility and security.
  • Recovery and rollback: study detection and remediation when norms are hijacked (e.g., norm reset, provenance backtracking, memory scrubbing); measure recovery time and residual risk.
  • Transparency and explanation: develop methods for agents to justify CI parameter inference and norm judgments (per-action rationales), enabling user oversight and post-hoc auditing.
  • Benchmarking protocol for CI: propose standardized tasks, scoring, and reporting for CI-based evaluation (parameter extraction, legitimacy assessment, flow separation, history grounding), enabling comparable progress tracking.
  • Interaction with other defenses: empirically assess compatibility and composition with sandboxing, tool isolation, retrieval hardening, and semantic firewalls; identify synergistic and antagonistic effects.
  • Scalability and latency: quantify compute and latency overhead of CI-aware reasoning, verification, and red-teaming in production pipelines; explore caching and incremental verification to keep systems responsive.

Practical Applications

Immediate Applications

The paper’s CI-based reframing of prompt injection, empirical findings, and datasets enable concrete changes to how industry and researchers evaluate, design, and deploy agentic systems today. The following use cases can be deployed now with existing tooling and modest engineering effort.

  • CI-grounded red-teaming and QA for agent products
    • Sectors: software, enterprise productivity, security, finance, healthcare, education
    • What to do: Incorporate CI-parameter attacks (sender, receiver, subject, information type, transmission principle) and context-manipulation scenarios into release QA and internal red-teaming for agents (email assistants, CRMs, coding agents, helpdesk bots). Use pairwise scenarios that differ only in delegation/authority to test contextual appropriateness and detect over/under-compliance.
    • Tools/workflows: A “CI-RedTeam” harness that generates/curates emails and multi-turn tasks with paired delegation contexts; automated judges that tag which CI parameter was compromised; metrics that report utility vs. contextual-appropriateness error.
    • Assumptions/dependencies: Availability of CI-style scenario templates/datasets; ability to integrate model-in-the-loop judges; engineering bandwidth for test automation.
  • Layered enforcement: system-level provenance checks before agent action
    • Sectors: enterprise email, finance operations, procurement, healthcare scheduling/EHR, HR
    • What to do: Gate actions on verifiable provenance rather than model-only judgment—e.g., require verified sender identity (SPF/DKIM/DMARC, S/MIME), domain-based trust tiers, and structured approvals before tool calls (send_email, ticket creation, payments).
    • Tools/workflows: “Provenance Gate” middleware that checks cryptographic signatures, domain allowlists, and policy-backed approval tokens; configurable escalation to human review for unverifiable claims.
    • Assumptions/dependencies: Organizational adoption of email signing/DMARC policies, identity directories; integration with MDM/IdP and ticketing/CRM; acceptance of added friction for safety-critical actions.
  • Flow separation in agent pipelines (avoid authorization leakage)
    • Sectors: enterprise productivity, legal, finance, healthcare
    • What to do: Parse each incoming message to identify simultaneous flows (e.g., internal task + external commitment) and evaluate them independently; ensure that authorization for F1 does not implicitly carry over to F2.
    • Tools/workflows: A “Flow Separator” component that segments requests, tags internal vs. external recipients via domain heuristics, and enforces per-flow policy; separate tool invocations for each flow; explicit prompts/instructions that prohibit cross-flow inference.
    • Assumptions/dependencies: Reliable email/domain metadata; well-scoped tool routing; monitoring for false positives/negatives in flow detection.
  • Delegation-scoped UI and policy prompts for enterprise copilots
    • Sectors: enterprise software, contact centers, sales/marketing ops
    • What to do: Introduce explicit, task-bounded delegation toggles (“read-only,” “internal actions only,” “external comms with approval”) and reflect them consistently in system prompts and orchestration.
    • Tools/workflows: “Delegation Manager” UI and policy engine that captures user intent; prompt templates that embed the current transmission principle; approval workflows when requests exceed scope.
    • Assumptions/dependencies: Product/UX changes; organizational clarity on role-based permissions; user willingness to set scope upfront.
  • History-aware decision-making to improve norm grounding
    • Sectors: customer support, project management, procurement
    • What to do: Retain and surface relevant interaction history to ground what’s in-scope, reducing over-compliance with out-of-scope requests; display history snippets that justify decisions.
    • Tools/workflows: “Memory Grounder” that retrieves prior delegations, approvals, and similar in-scope tasks to inform current decisions; caching/context retrieval that is safe and auditable.
    • Assumptions/dependencies: Storage and retrieval infrastructure; privacy controls for history data; guardrails to prevent memory corruption.
  • CI-aware alerting and audit trails for appropriateness decisions
    • Sectors: compliance, risk, security, regulated industries (finance, healthcare, education)
    • What to do: Log the inferred CI parameters and the rationale for each action; emit SECURITY ALERTs or raise tickets for suspicious context manipulation (e.g., fabricated approvals, inconsistent senders).
    • Tools/workflows: “CI Audit Log” schema with sender/receiver/subject/type/transmission principle tags; SOC integrations; dashboards for violation vs. utility trade-offs.
    • Assumptions/dependencies: Centralized logging; privacy safeguards; clear escalation/response playbooks.
  • Training and evaluation updates to avoid over-restrictive safety tuning
    • Sectors: model providers, MLOps, applied research
    • What to do: Add context-sensitive, paired scenarios (same surface form, different delegation) to training/eval to reduce the risk that alignment suppresses legitimate engagement (e.g., flagging suspicious content) while still preventing unauthorized actions.
    • Tools/workflows: “CI Pair Curriculum” for fine-tuning; evaluation suites that measure both utility and contextual appropriateness; model selection that balances trade-offs.
    • Assumptions/dependencies: Availability of labeled pairs; careful objective design to avoid utility collapse; compute for training/eval.
  • Sector-tailored policy gating
    • Healthcare: Gate disclosures of PHI to external addresses unless patient consent or provider authorization is verifiable; prompt agents to verify institutional policy references before acting.
    • Finance: Require verified trader/manager sign-off for orders/commitments; separate internal bookkeeping from external confirmations; block “standard protocol” claims without documentary evidence.
    • Education: Enforce FERPA-like norms for disclosing student information to third parties; separate grading/internal tasks from outbound communication.
    • Assumptions/dependencies: Machine-readable policies; identity and consent records; integration with EHRs, OMS/EMS, SIS/LMS.
  • Vendor and procurement checklists for agent deployments
    • Sectors: enterprise IT, procurement, governance/risk/compliance
    • What to do: Add CI-grounded tests and provenance/flow-separation requirements to RFPs, acceptance criteria, and third-party risk assessments for agentic tools.
    • Tools/workflows: Standardized checklists; CI-based penetration tests; procurement gates conditioned on CI metrics.
    • Assumptions/dependencies: Organizational buy-in; vendor cooperation; baseline security posture.

Long-Term Applications

The paper’s impossibility argument and CI reframing point to deeper innovations that require further research, standardization, and ecosystem development.

  • CI-aware alignment and RL with CI-derived rewards
    • Sectors: model providers, academic research
    • What it enables: Training models to reason over CI parameters, separate simultaneous flows, and optimize for contextually appropriate actions under uncertainty, using paired scenarios and CI-based reward signals.
    • Tools/products: “CI-RL” pipelines; datasets spanning sectors and cultures; evaluation harnesses for utility–safety trade-off curves.
    • Dependencies: Scalable, diverse CI benchmarks; robust reward modeling; techniques to avoid specification gaming.
  • Semantic firewalls and contextual security policy engines
    • Sectors: enterprise platforms, security vendors, multi-agent systems
    • What it enables: Runtime enforcement that projects each message onto a task-defined CI context and only permits actions consistent with verified norms; unifies privacy and security under CI.
    • Tools/products: “CI Policy Engine” and “Semantic Firewall” integrated with agent frameworks (tool routers, planners, communicators).
    • Dependencies: High-precision CI extractors; fast provenance checks; low-latency policy decision points.
  • Organization-wide context verification and attestations
    • Sectors: email, collaboration, cross-org workflows, supply chain
    • What it enables: Standardized, verifiable proofs of authority (e.g., digitally signed approvals, delegation tokens, time-bound scopes) that agents can check automatically across organizations.
    • Tools/products: “Delegation Tokens” (structured, signed claims); cross-org attestations; protocols/APIs for authority verification.
    • Dependencies: Standards and PKI adoption; federation across vendors; UX for human approvals; backward compatibility with legacy systems.
  • Secure agent memory and norm-anchoring defenses
    • Sectors: long-running agents, RPA, customer support, robotics
    • What it enables: Memory designs that are resilient to context-manipulation and early-interaction anchoring attacks; provenance-aware memory writes and reads; rollback and quarantine for corrupted memory.
    • Tools/products: “Memory Provenance Chains,” write policies, anomaly detectors for context drift.
    • Dependencies: Provenance labeling in agent frameworks; storage with integrity guarantees; monitoring and recovery tooling.
  • Skill and tool supply-chain security for agents
    • Sectors: software, devops, automation platforms
    • What it enables: Verification pipelines for “skill files” and tools (origin, integrity, scope) to prevent instruction-like behaviors embedded in capabilities; CI labeling for each tool’s permitted flows.
    • Tools/products: Tool registries with signed manifests and CI metadata; “Capability Gatekeepers” that enforce scope during composition.
    • Dependencies: Tool/skill ecosystem standards; signing and attestation; developer adoption.
  • Multi-agent CI governance and conversation controls
    • Sectors: marketplaces, B2B integrations, autonomous service networks
    • What it enables: CI-aware protocols for agent-to-agent communication where each message is evaluated against the sender’s and receiver’s roles and purposes; prevent privacy/security violations in negotiated tasks.
    • Tools/products: “CI Handshake” protocols; per-agent role certificates; negotiation guards.
    • Dependencies: Interoperable schemas for roles/purposes; shared policy registries; cross-vendor compliance.
  • Regulatory frameworks and certifications for agentic systems
    • Sectors: public policy, regulated industries
    • What it enables: CI-based safety standards, certification checklists, and audit requirements that reflect contextual appropriateness rather than string-pattern defenses; metrics for over- vs. under-compliance.
    • Tools/products: Compliance baselines; third-party certification programs; sector-specific profiles (HIPAA/FERPA/SOX-aligned CI tests).
    • Dependencies: Policymaker–industry collaboration; public benchmarks; capacity for audits.
  • Cultural and organizational norm personalization
    • Sectors: global enterprises, consumer assistants
    • What it enables: Models and policy layers that adapt CI norms to organizational policy and regional/cultural expectations, reducing false positives/negatives.
    • Tools/products: “Norm Profiles” per org/team; adaptive policy learning; explainable decisions for stakeholder trust.
    • Dependencies: Data collection on norms; governance for customization; privacy considerations.
  • Formal trade-off modeling and decision support for autonomy
    • Sectors: safety-critical automation (finance, healthcare, robotics)
    • What it enables: Quantitative frameworks and dashboards that make the over-/under-compliance trade-off explicit, with adjustable risk thresholds and human-in-the-loop intervention points.
    • Tools/products: “Appropriateness Risk Meters,” cost-of-inaction vs. cost-of-action estimators; policy simulators.
    • Dependencies: Risk models; historical incident data; integration with operations.
  • Education and workforce training on CI for AI deployments
    • Sectors: academia, professional training, enterprise enablement
    • What it enables: Curricula and workshops that teach CI reasoning, red-teaming, and system design for contextual safety; improves deployment literacy and reduces misuse.
    • Tools/products: Course modules, case libraries, hands-on labs with CI datasets.
    • Dependencies: Open educational resources; collaboration between universities and industry.

These applications collectively shift practice from brittle “data vs. instructions” filters to a CI-grounded approach that evaluates and enforces appropriateness given real-world context, while acknowledging the trade-offs and verification gaps surfaced by the paper’s impossibility result.

Glossary

  • Adversarial suffixes: Short adversarial strings appended to inputs to induce undesired model behavior, often optimized to transfer across prompts or models. "AutoInject learns transferable adversarial suffixes via reinforcement learning"
  • Agentic LLMs: LLMs that act as autonomous agents, making tool calls and decisions within tasks. "Evaluating and attacking agentic LLMs."
  • Alignment-based safety training: Fine-tuning approaches that align models with safety policies to resist manipulative inputs. "current alignment-based safety training can both degrade contextual appropriateness and reduce utility."
  • Attack surface: The set of potential points in a system where an attacker can attempt to exploit or manipulate behavior. "current research addresses a shrinking fraction of future attack surfaces."
  • AUROC: Area Under the Receiver Operating Characteristic curve; a metric for binary classifier performance. "perform at near-chance (AUROC 0.43--0.59)"
  • Black-box search: Optimization without access to a model’s internal parameters, using only input-output behavior. "AgentVigil uses black-box search to find payloads that survive agent pipelines."
  • Context hijacking: Manipulating or co-opting the execution context to make harmful actions appear legitimate. "framing context hijacking over user data as a CI violation."
  • Context manipulation: Altering contextual cues or claims (e.g., identity, policy) to influence an agent’s decisions. "perform at near-chance (AUROC 0.43--0.59) when attacks operate through context manipulation rather than injection vocabulary"
  • Contextual appropriateness: Whether an action is suitable given the norms and constraints of the surrounding context. "worsening contextual appropriateness."
  • Contextual Integrity (CI): A theory that judges the appropriateness of information flows by contextual norms (sender, receiver, subject, information type, transmission principle). "Contextual Integrity (CI) offers the practical lens for probing and grounding this broader class of failures."
  • Contextual security: Defense approaches that adapt protections based on the interpreted context of interactions. "Approaches like contextual security and semantic firewalls"
  • Cryptographic attestation: A cryptographic method to prove identity or integrity of data or senders. "by e.g., cryptographic attestation of sender identity"
  • Data/code boundary: A separation principle that treats input data as non-executable and enforces strict borders between content and instructions. "aims to enforce a data/code boundary."
  • Data-instruction separation: The defense paradigm that attempts to distinguish and isolate instructions from data to prevent prompt injection. "the prevailing defense paradigm (data-instruction separation)"
  • Delegation boundaries: Limits defining what actions an agent is authorized to perform on behalf of a user. "These rates indicate a general failure to reason about delegation boundaries"
  • Delegation scope: The precise range of tasks and actions the user authorizes the agent to execute. "the agent must infer its delegation scope."
  • Dual firewalls: A two-layer filtering approach to enforce contextual norms in agent-to-agent communication. "propose dual firewalls for agent-to-agent communication"
  • Entrenched norm: An established contextual rule that defines acceptable information flow or action. "A flow that conforms to the entrenched norm is appropriate"
  • Exfiltration: Unauthorized extraction or leakage of sensitive data to an external party. "exfiltrate a third party's data."
  • Flow separation: The practice of isolating and evaluating multiple concurrent information flows independently within a single message or thread. "flow separation as a challenge in norm enforcement"
  • Ground-truth labels: Verified correct labels used to evaluate or train models. "assume ground-truth labels for appropriateness"
  • Grounding history: Prior interaction history that helps an agent infer the applicable norms and delegation accurately. "Norm Inference Requires Grounding History"
  • Impossibility result: A theoretical claim showing that no fixed policy can block all attacks without also blocking some legitimate actions. "This reframing suggests an impossibility result"
  • Information-flow control: Mechanisms or policies that regulate how information moves between entities to enforce security/privacy properties. "derive information-flow control from CI"
  • Long-horizon tasks: Extended, multi-step activities requiring sustained autonomous operation and planning. "Agents running complex and long-horizon tasks"
  • Multi-agent systems: Environments where multiple agents interact and coordinate, potentially with distinct roles and objectives. "privacy and security of multi-agent systems"
  • Norm evaluation: The process of judging whether a contemplated action complies with contextual norms and is legitimate. "Norm evaluation and its vulnerability."
  • Norm grounding: Inferring and establishing which norms apply in a given context from available evidence and history. "A further challenge concerns norm grounding"
  • OWASP Top 10 for LLM Applications: A security risk taxonomy tailored to LLM-integrated systems. "OWASP Top 10 for LLM Applications"
  • Prima facie CI violation: An apparent violation of contextual integrity before deeper justification is considered. "constitutes a prima facie CI violation."
  • Privilege escalation: Gaining higher permissions or capabilities than authorized, often leading to destructive or sensitive actions. "privilege escalation (destructive, impersonating, or exfiltrating"
  • Provenance: The verifiable origin or history of data or messages used to assess authenticity and trust. "the absence of verifiable provenance."
  • Red-teaming: Systematic adversarial testing to discover vulnerabilities through simulated attacks. "CI-grounded red-teaming should benefit privacy and security"
  • Semantic firewalls: Policy or model-based filters that evaluate meaning and intent to block unsafe actions or information flows. "semantic firewalls could dynamically protect an agent"
  • System-level isolation: Architectural separation that prevents external content from influencing sensitive components or actions. "using system-level isolation"
  • Transmission principle: The condition or authority under which an information flow or action is permitted (e.g., consent, policy). "the transmission principle (under what conditions the flow is authorized"
  • Trust boundaries: The inferred borders between trusted and untrusted parties or domains in communication. "infer trust boundaries from email domains."
  • User delegation: The explicit or implicit authorization a user grants an agent to act on their behalf. "User delegation determines whether sending is contextually appropriate."

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 135 likes about this paper.