Personalized Agent Security Bench
- The paper introduces a comprehensive framework that quantifies CIA vulnerabilities in personalized AI agents using adaptive, scenario-driven attack simulations.
- PASB integrates modular components such as scenario generation, toolchain integration, multi-turn interaction harness, and evaluation metrics to stress-test defenses.
- Empirical evaluations reveal that even state-of-the-art agents remain susceptible to significant confidentiality, integrity, and availability risks under realistic attack conditions.
A Personalized Agent Security Bench (PASB) is a comprehensive, scenario-driven, and attack-adaptive framework for quantifying, comparing, and stress-testing the security posture of personalized AI agents. It systematizes end-to-end evaluation, enabling security researchers to empirically characterize the confidentiality, integrity, and availability (CIA) vulnerabilities of agents that operate in complex and realistic deployment environments. PASB’s architecture accommodates black-box and gray-box settings and is extensible to both single and multi-agent workflows, integrating threat-specific metrics, scenario instantiation, and defense validation (Wang et al., 9 Feb 2026, &&&1&&&, Mukhopadhyay et al., 31 Dec 2025, Li et al., 12 Mar 2026, Sharma et al., 23 Jul 2025).
1. Formal Modeling of Personalized Agents and Threats
PASB is grounded in a precise formalization of the personalized agent environment. Consider a set of users , each associated with a private profile; the agent maintains a long-term memory , invokes an external toolchain parameterized by privilege levels, and operates via a backbone LLM and system prompt . The agent observes at each step : , where is the user query, is untrusted external content, is the last tool return, and is retrieved memory.
The attack surface is parameterized by a family of end-to-end attack tasks , where comprises all adversarially controllable injection channels—including prompt, tool return, memory, or untrusted external content (Wang et al., 9 Feb 2026, Mukhopadhyay et al., 31 Dec 2025). The PASB threat model typically assumes black-box access to the agent’s LLM, system prompt, and memory, but permits adaptive exploitation via any of these channels.
2. Framework Architecture and Component Modules
PASB is architected as a modular evaluation harness with four canonical subsystems (Wang et al., 9 Feb 2026):
- Scenario Generator: Instantiates parameterized attack environments (personalized profiles, memory, toolchain) and introduces “canary” artifacts (privileged secrets) to quantitatively gauge leakage.
- Toolchain Integration: Deploys stubbed or real tools, exposing controllable endpoints for attack vectors such as tool-return deception or hijacking.
- Interaction Harness: Drives long-horizon, multi-turn agent-adversary interactions. It supports causal or adaptive attack strategies and logs the agent’s entire response trajectory.
- Evaluation & Metrics Module: Adjudicates harm predicates (e.g., leakage, privilege misuse, persistence), computes attack success rate (ASR), coverage, time-to-breach, and other scenario-specific metrics.
The evaluation loop is typically driven by iterating over a suite of scenarios , for each of which repeated trials are executed with variable random seeds and adversarial strategies. Injection primitives may include direct prompt injection, indirect injection (e.g., via external web content), tool return deception, and long-term memory poisoning. The adversary’s injection decisions can be adaptively updated based on observed model outputs, enabling realistic simulation of multi-turn attacks (Wang et al., 9 Feb 2026, Pai et al., 7 Feb 2026).
3. Benchmark Design: Scenarios, Metrics, and Attack Primitives
PASB instantiates attack tasks reflecting realistic, high-value threat scenarios for personalized agents (Wang et al., 9 Feb 2026, Li et al., 12 Mar 2026, Sharma et al., 23 Jul 2025, Mukhopadhyay et al., 31 Dec 2025). Canonical categories, each mapped to CIA risk domains, include:
| Scenario | Attack Vector | Evaluated Risk |
|---|---|---|
| Indirect prompt injection | Web content | Confidentiality |
| Tool schema poisoning | Malformed schema | Integrity |
| Memory poisoning | Retrieval exploits | Confidentiality |
| Multi-agent coordination | Cascading attacks | All (esp. integrity) |
Evaluation metrics systematically cover:
- Attack Success Rate (ASR): —probability of compromise across scenarios.
- Time-to-Breach (): Expected number of interaction rounds until a successful compromise.
- Coverage (): Fraction of task classes or canary secrets for which compromise is achieved.
- Response Rate: Fraction of trials with privileged tool invocation (IPI experiments).
- STM/LTM Extract & Write Success Rates: For short- and long-term memory manipulation (Wang et al., 9 Feb 2026).
- Blast Radius (), Chain Length (), Cascading Impact Score (CIS), Compound Effect (): Especially crucial in multi-agent settings; is the total number of compromised agents, is the length of the longest infection chain, and quantifies collapse of cross-checks (Sharma et al., 23 Jul 2025).
PASB implementations incorporate both empirically observed and mathematically formalized scoring functions, e.g.,
where is empirically measured under adaptively optimized adversary strategies (Li et al., 12 Mar 2026).
4. Implementation, Automation, and Adaptation
PASB codebases offer fully automatable, reproducible pipelines, often with modular structure for easy extension (Wang et al., 9 Feb 2026, Pai et al., 7 Feb 2026). Key implementation features include:
- Agent-Oriented Wrappers: Abstract agent APIs (e.g., “send_message(input) → (r_t, \kappa_t)$”) allowing plug-and-play evaluation of new models.
- Scenario and Attack Libraries: Collections of scriptable scenarios and attack vectors, supporting parameterized evaluation across agent deployments, toolchains, or threat models.
- Automation Harness: Looping over scenarios, payloads, and agent configurations, logging detailed traces and metric outputs to CSV or database.
- Defense Instrumentation: Facilities for toggling baseline vs. hardened configurations, enabling quantitative trade-off analysis for mitigations such as delimiter-based IPI prevention, tool approval protocols, or memory access controls.
Transferability and extensibility are supported via configuration-driven corpus definitions, modular behavioral judges, and subclassable mutation and injection operators (Pai et al., 7 Feb 2026).
5. Case Studies and Empirical Insights
PASB has been applied to both local and service-based personalized AI agents. The OpenClaw case study provides detailed empirical demonstrations: under adversarial prompt or content injection, even state-of-the-art models (e.g., Llama-3.1-70B-Instruct, Qwen2.5-7B-Instruct, GPT-4o-mini) retain high residual ASR and memory-manipulation vulnerabilities, even when common defenses are activated (Wang et al., 9 Feb 2026). For example:
- IPI Simulation: Llama-3.1-70B featured 46% ASR (no defense), 21.5% (delimiter), and 14% (sandwich) in indirect prompt injection experiments.
- Memory Extraction: LTM extraction succeeded at rates as high as 62.5% for unprotected agents.
- Write Success Rates: Long-term memory was more susceptible to adversarial overwrite or poisoning compared with STM.
This suggests that even layered or multi-modal mitigations are insufficient to eliminate core failure modes in realistic deployments.
6. Comparative Frameworks and Integration
Several complementary approaches enrich PASB’s methodological toolkit:
- Evolutionary Security Evaluation: NAAMSE models security as a feedback-driven optimization problem, employing genetic prompt mutation, hierarchical corpus exploration, and asymmetric behavioral scoring. This supports continuous, adaptive benchmarking far beyond static corpora and enables rapid discovery of compounding vulnerabilities (Pai et al., 7 Feb 2026).
- Privacy-Focused Benchmarks: PrivacyBench establishes a protocol for embedding secrets and measuring leakage in retrieval-augmented agents. PASB extends it with richer threat modeling (adversarial probing, multi-turn scenario design), new metrics (e.g., confidentiality loss, integrity, and availability scores), and structural safeguards (e.g., context-aware retrieval filters) (Mukhopadhyay et al., 31 Dec 2025).
- Multi-Agent Cascading Risks: PASB generalizes to multi-agent systems by formalizing propagation probability, blast radius, and contagion amplification factors, as captured by the Agent Cascading Injection (ACI) model. PASB quantifies not only isolated agent compromise but also system-wide collapse of defense-in-depth (Sharma et al., 23 Jul 2025).
7. Policy Models, Standards Alignment, and Defense Validation
PASB facilitates empirical evaluation of policy models for delegation, privilege control, and risk-adaptive enforcement. Hybrid RBAC + Risk-Adaptive Access Control (RAdAC) schemes are instantiated, associating quantitative risk with each operation and enforcing end-to-end constraints over agent delegation chains (Li et al., 12 Mar 2026). The framework aligns with NIST’s Risk Management Framework (RMF SP 800-37), supporting continuous monitoring, layered defense validation, and residual risk reporting.
Defense validation is performed via side-by-side comparison of baseline and hardened configurations, enabling measurement of resilience gain, performance overhead, detection accuracy, and usability costs.
By unifying rigorous formalism, adaptive attack simulation, empirical scoring, and modular automation, PASB establishes a robust foundation for security benchmarking of personalized, tool-augmented, and multi-agent AI deployments. It is directly supported by and extensible from recent research and code releases (Wang et al., 9 Feb 2026, Pai et al., 7 Feb 2026, Mukhopadhyay et al., 31 Dec 2025, Li et al., 12 Mar 2026, Sharma et al., 23 Jul 2025).