Prompt Flow Integrity (PFI)
- Prompt Flow Integrity (PFI) is a security framework that ensures no lower-privilege prompt segments override higher ones by enforcing strict flow controls.
- It leverages structured segmentation, provenance tracking, runtime policy enforcement, and agent isolation to counteract prompt and data injection attacks.
- PFI demonstrates practical benefits with negligible latency overhead and significant improvements in secure utility rates while addressing complex LLM security challenges.
Prompt Flow Integrity (PFI) is a principled framework for enforcing security guarantees in LLM agents and retrieval-augmented generation (RAG) systems, specifically designed to prevent prompt injection and privilege escalation. PFI encompasses mechanisms for structured prompt segmentation, provenance tracking, runtime policy enforcement, agent isolation, and persistent validation of data/control flows based on trust and authority. Deployed primarily as middleware layers or agent-wrapping protocols, PFI offers rigorous defenses against direct and indirect manipulation by untrusted sources without sacrificing the utility or programmability of LLM-based systems (Alam et al., 19 Mar 2026, Kim et al., 17 Mar 2025).
1. Formal Definitions, Threat Model, and Control-Flow Invariant
Prompt Flow Integrity is defined as an architectural property ensuring that no lower-privilege or untrusted input can override, contradict, or escalate above the authority of higher-privilege actors within an LLM system or agent. In PCFI, each complete prompt is viewed as the composition:
where (system), (developer), (user), and (retrieved) segments are tagged as five-tuples , denoting raw text, role, priority, provenance, and auxiliary metadata respectively. The authority lattice is strictly , and the PFI invariant states that no segment with lower can introduce forbidden directives that affect any segment with higher (0) (Alam et al., 19 Mar 2026).
For LLM agents, PFI secures the environment against privilege escalation by distinguishing trusted and untrusted data sources, and ensures that no untrusted input (1) reaches privileged sinks (privileged plugins or the final answer) unless explicitly approved by the user (Kim et al., 17 Mar 2025). The system explicitly tags all data using provenance identifiers and authority labels and enforces the trust boundary through mediation and auditing across the entire prompt flow and plugin invocation lifecycle.
2. Core Security Challenges and Attack Classes
PFI addresses multi-faceted risks inherent in LLM and agent systems:
- Untrusted-Data Processing: Plugins may ingest data from sources controlled by adversaries (e.g., web, email), compromising agent behavior.
- Lack of Least-Privilege: Absent privilege separation, prompt-level injection can yield full agent access to sensitive plugins and data.
- Missing Data-Flow Validation: Probabilistic mixing of data/control flows across tokens enables implicit and explicit high-to-low privilege violations, with traditional sandboxing and taint tracking proving insufficient in natural language contexts (Kim et al., 17 Mar 2025).
These threats produce attack classes:
- Prompt-injection attacks (compromising control flow, e.g., overriding system prompts),
- Data-injection attacks (covertly influencing plugin arguments, side effects, or transmitted outputs).
3. Enforcement Architectures: PCFI and Agent-Oriented PFI
PCFI: Priority- and Provenance-Aware Prompt Gateway
The Prompt Control-Flow Integrity (PCFI) system is an enforcement middleware structured as a three-stage FastAPI gateway:
- Stage 1 (Lexical Heuristics): Prompts from 2 and 3 are scanned for high-risk patterns using a pre-compiled list of sensitive keywords/phrases. Requests are flagged, not blocked, at this stage.
- Stage 2 (Role-Switch Detection): Identifies impersonation attempts (e.g., injection of "system: do X") in 4 or 5; sanitizes by stripping or renaming markers and issues a SANITIZE verdict if only impersonation is detected.
- Stage 3 (Hierarchical Policy Enforcement): Applies forbidden-directive rules 6, matching each lower-priority segment for directives that contradict any higher-priority present segment; block decision is enforced if matches are found.
At each stage, provenance and priority metadata are attached to content spans, enabling fine-grained accountability and policy traceability (Alam et al., 19 Mar 2026).
Agent Isolation and Secure Untrusted Data Processing
The agent-oriented PFI protocol splits agents into Trusted (A_T) and Untrusted (A_u) subagents:
- Untrusted Data Identification: Every plugin result is tagged with 7, mapped as Trusted/Untrusted by policy 8.
- Least-Privilege Enforcement: 9 holds full privileges 0, while 1 is spawned with only unprivileged plugin access 2 whenever untrusted data is encountered.
- Data/Control-Flow Validation: All flows from untrusted sources passing into privileged contexts trigger FlowCheck, involving explicit user intervention for approval (requiring consent for data or prompt-type flows into privileged actions) (Kim et al., 17 Mar 2025).
4. Policy Semantics, Priority Lattices, and Rule Specification
Policies in PFI, especially PCFI, are specified as structured JSON or YAML entries:
- Each rule defines an identifier, matching pattern (string/regex), list of forbidden lower roles, and governing higher roles.
- The priority lattice assigns numeric/ordinal values: 3 (system, 4=4) 5 6 (developer, 7=3) 8 9 (user, 0=2) 1 2 (retrieved, 3=1).
Rule enforcement logic in PCFI:
- On identifying a forbidden pattern 4 in a lower segment 5 with priority 6, if any higher segment 7 with 8 exists, the request is blocked.
- If only impersonation markers are detected, a SANITIZE operation strips markers without full blocking.
- Otherwise, requests are allowed (Alam et al., 19 Mar 2026).
In agent-based PFI, policy 9 over 0 governs trust boundaries, and 1 defines plugin privilege for each subagent. The invariant is that no untrusted 2 may leak to a privileged sink without the user's explicit approval, enforceable in both data and control flows.
5. Evaluation Results and Performance Overhead
Benchmark studies for PCFI show:
- Attack Pass-Through Rate (APR): 0%—PCFI intercepts all attack-labeled requests (baseline without defense: APR=100%).
- False Positive Rate (FPR): 0%—no benign prompt blocked or sanitized erroneously.
- Latency Overhead: median 0.04 ms, 3 0.08 ms, 4 0.14 ms per request, demonstrating negligible runtime cost (Alam et al., 19 Mar 2026).
For agent-based PFI, evaluated on AgentDojo and AgentBench OS benchmarks:
- Secure Utility Rate (SUR) (percent of tasks performed both successfully and securely) improved from 22.7% (baseline) to 61.9% (PFI) on AgentDojo, and from 0.0% to 68.4% on AgentBench OS.
- Attacked Task Rate (ATR) (attack success): baseline up to 100%, reduced to 0.0% with PFI.
- False-positive prompts from FlowCheck were under 1%, with zero false negatives. Performance overhead is approximate 2× model call latency due to subagent querying (Kim et al., 17 Mar 2025).
6. Limitations and Open Directions
Known limitations:
- Pattern-Based Evasion: PCFI's pattern/regex checks may miss paraphrased or obfuscated directives, e.g., "disregard earlier rules."
- Single-Turn Coverage: PCFI currently enforces integrity only within a single API turn; multi-turn poisoning and tool-callback chains remain outside its scope.
- Benchmark Representativeness: Evaluations use semi-synthetic datasets, which may under-represent real-world adversarial diversity.
- Model Alignment and Utility: In agent PFI, some losses in utility stem from the model's inability to reason robustly about proxy tokens or queries; improvements may require targeted fine-tuning.
Potential extensions include embedding similarity-based detectors for paraphrase resilience, extending provenance/trust graphs across multi-turn dialogues, enhancing policy expressiveness (e.g., quantitative or reactive thresholds), end-to-end provenance tracking through plugins, and combining PFI with semantic guardrails or classifier-based verification layers (Alam et al., 19 Mar 2026, Kim et al., 17 Mar 2025).
7. Relationship to Existing LLM System Design
PFI mechanisms have been integrated as external middleware (PCFI) and as wrappers within ReAct-style agent architectures, preserving tool invocation compatibility while interposing policy enforcement at key junctures. By segmenting roles, formalizing priority, and enforcing provenance- and authority-based flow invariants, PFI systems establish strong, tractable boundaries against both prompt- and data-level injection, without requiring LLM retraining or internal model modification. This establishes PFI as a foundational security primitive for modern LLM deployments and agent platforms (Alam et al., 19 Mar 2026, Kim et al., 17 Mar 2025).