Papers
Topics
Authors
Recent
Search
2000 character limit reached

InjecAgent Benchmark: Evaluating Agent Security

Updated 3 March 2026
  • InjecAgent Benchmark is a quantitative framework that rigorously evaluates multi-agent and tool-integrated LLM systems against prompt injection and cascading compromise threats.
  • It employs a formal threat model with metrics like Attack Success Rate, Blast Radius, and Propagation Chain Length to enable repeatable, stress-test protocols under adversarial conditions.
  • Benchmark findings highlight key vulnerabilities and trade-offs in LLM agent architectures, driving the development of adaptive defenses and zero-trust strategies.

InjecAgent Benchmark is a quantitatively rigorous framework for evaluating the security of multi-agent systems and tool-integrated LLM agents against prompt injection and cascading compromise scenarios. Conceived initially to address vulnerabilities in agent-mediated tool chains exposed to indirect prompt injection (IPI), InjecAgent provides formal methodology, metrics, and repeatable protocols for stress-testing both single- and multi-agent systems under adversarial conditions—including sophisticated Agent Cascading Injection (ACI) attacks and classical IPI threats. The benchmark encompasses both synthetic and real-world tool invocation scenarios, mapping results to established security risk taxonomies such as the OWASP Agentic AI Top 10, and supports data-driven comparison of architectures, agent backends, trust protocols, and mitigating defenses (Sharma et al., 23 Jul 2025, Zhan et al., 2024).

1. Formal Threat Model and Attack Scenarios

InjecAgent formalizes two primary security threats for agentic LLM systems:

  • Indirect Prompt Injection (IPI): An attacker crafts malicious input embedded in external tool outputs or content accessed by the agent, ultimately causing the agent to deviate from the benign user’s task and perform attacker-controlled actions (e.g., unauthorized fund transfers, data exfiltration) (Zhan et al., 2024). The formal workflow involves a user uu issuing IuI_u to agent LL, which calls user tool TuT_u; the response RTuR_{T_u} contains the attacker’s instruction IaI_a. If LL subsequently invokes attacker tool TaT_a as per IaI_a, the attack is successful.
  • Agent Cascading Injection (ACI): In multi-agent settings, a single compromised agent aca_c emits a polluted message mcm_c due to injected exploit ε\varepsilon, propagating across the agent network along directed trust edges T⊆A×AT \subseteq \mathcal{A} \times \mathcal{A} (Sharma et al., 23 Jul 2025). The adversary’s objective is to maximize the blast radius B=∣C(ε,ac)∣B = |C(\varepsilon, a_c)|, potentially leading to system-wide failure via compromised trust topologies.

Attack scenarios in InjecAgent include both direct-harm (immediate physical, financial, or system damage) and data-stealing (multi-step attacks for private information extraction and exfiltration).

2. Benchmark Design and Methodology

Single-Agent (IPI-Focused) Protocol

The benchmark evaluates agentic LLMs through a comprehensive suite:

  • Test Case Generation: 1,054 scenarios spanning 17 user-facing tools and 62 attacker-invocable tools. Each combines a realistic user instruction with an attacker-embedded payload in the tool response, reflecting diverse APIs (email, webscraping, calendar, banking, etc.) (Zhan et al., 2024).
  • Attack Variants: Both base injections and "enhanced" attacks with explicit hacking prompts ("Ignore prior instructions…") are deployed, along with multi-turn dialogue variants that prime the agent context (Chang et al., 26 Sep 2025).
  • Execution and Evaluation: Each case is run with the agent under test (prompted via ReAct or fine-tuned for tool use), recording whether the agent follows the malicious instruction and whether task utility is preserved.

Multi-Agent (ACI-Focused) Protocol

To capture cascades:

  • Network Modeling: Agents A={a1,...,aN}\mathcal{A} = \{a_1, ..., a_N\} are connected via a directed trust graph TT, parameterized by sanitized input probabilities pi,jp_{i,j}.
  • Injection Procedure: After baseline runs, the exploit ε\varepsilon is injected at agent aca_c. The system is instrumented to log propagation events, detection, and compromise states at each agent.
  • Scenarios: Canonical scenarios include chain-of-delegation, orchestrated peer-review, resource-access escalation, and misinformation propagation across heterogeneous agent backends and trust structures.

3. Quantitative Metrics and Scoring

Core Metrics

  • Attack Success Rate (ASR): ASR=#{successful injections}#{total attempts}×100%\mathrm{ASR} = \frac{\#\{\text{successful injections}\}}{\#\{\text{total attempts}\}} \times 100\% (Zhan et al., 2024, Chang et al., 26 Sep 2025). "ASR-valid" restricts the denominator to valid-format tool outputs.
  • Blast Radius (BB): Number of agents compromised during ACI propagation (Sharma et al., 23 Jul 2025).
  • Propagation Chain Length (LL): Longest trust path traversed by the exploit.
  • Amplification Factor (α\alpha): Average secondary spread per compromised agent; α>1\alpha>1 signals supercritical (epidemic) propagation.
  • Compound Effect (Γ\Gamma): Covariance of compromise among critical agent pairs, exposing orchestrated collapse.
  • Detection Delay (DD): Hops before a monitoring agent alerts.
  • Harm Severity (HH): Discrete scale ([0,5][0,5]) adapted from AgentHarm.

Aggregate Score

A normalized security score is computed as:

SecurityScore=100−[w1⋅CR+w2⋅(L/Lmax)+w3⋅(D/Dmax)+w4⋅(H/5)]⋅100\mathrm{SecurityScore} = 100 - [w_1 \cdot CR + w_2 \cdot (L/L_{max}) + w_3 \cdot (D/D_{max}) + w_4 \cdot (H/5)] \cdot 100

with per-scenario and per-system aggregation for leaderboard construction.

4. Empirical Findings and Security Implications

InjecAgent’s results establish several critical vulnerabilities:

  • High Failure Rates: Leading ReAct-prompted LLM agents exhibit ASR of 23–32% (base), doubling under enhanced hacking prompts (Zhan et al., 2024). Template-forged and multi-turn primed attacks further raise ASR to 46–52% across open-source models; fine-tuned agents perform markedly better but remain imperfect (Chang et al., 26 Sep 2025).
  • Model-Performance Paradox: Larger, more capable LLMs (e.g., Llama2-70B, GPT-4) are more susceptible, contradicting naive expectations that capability would yield resilience.
  • Cascading Compromise: In multi-agent stress tests, excessive implicit trust or deep trust chains (high BB, LL, α\alpha) expose systems to rapid, amplified compromise, emphasizing the necessity for zero-trust policies and independent, diversified verification agents (Sharma et al., 23 Jul 2025).
  • Defensive Limitations: Standard prompt detectors, delimiters, and prompt repetition strategies show limited efficacy. Some novel firewall-style defenses (Tool-Output Firewall/Sanitizer) can suppress ASR to <1%—but are bypassed by adaptive attacks using obfuscated encodings or rare Unicode payloads (Bhagwatkar et al., 6 Oct 2025).
  • Lack of Utility Measurement: Earlier versions failed to report benign task utility under attack, hindering true assessment of security/utility trade-offs; this is now recognized as a critical benchmark requirement (Bhagwatkar et al., 6 Oct 2025).

5. Extensions, Critiques, and Benchmark Evolution

  • Critique and Best Practices: Analyses identify weaknesses in static, template-based attacks, lack of task utility measurement, ambiguous payloads, and absence of adversarial diversity. Recommended remedies include incorporating utility-under-attack, randomized/dynamic tool scenarios, adaptive attacks (obfuscation, encoding, multimodality), partial credit for incomplete attacks, and more rigorous auditing of success predicates (Bhagwatkar et al., 6 Oct 2025).
  • Benchmark Influence: The design has inspired extensions such as SecureAgentBench for security-focused code generation (tracking functional correctness and residual vulnerability with structured PoC testing and static analysis) (Chen et al., 26 Sep 2025), as well as more dynamic and open-ended agent benchmarks (e.g., AgentDyn, AgentLAB), which push beyond static single-step IPI to cover multi-turn, long-horizon, and adaptive attack strategies (Li et al., 3 Feb 2026, Jiang et al., 18 Feb 2026).
  • State-of-the-Art Defenses: While simple "firewall" modules can saturate ASR on legacy InjecAgent constructs, recent work emphasizes the need for context-aware, sequence-based defenses, adaptive detectors leveraging structural and semantic cues, and attack–defense co-evolution (Bhagwatkar et al., 6 Oct 2025, Jiang et al., 18 Feb 2026).

6. Practical Impact and Deployment Guidance

  • Architectural Trade-Offs: Benchmarking results offer actionable insights into the security/performance frontier for agentic architectures. Trade-offs include stricter input validation (lower pi,jp_{i,j}, reduced throughput), limited fan-out for trust edges (constrained α\alpha, increased latency), and persona-locking for system prompts (mitigating class-3 manipulations at the expense of flexibility) (Sharma et al., 23 Jul 2025).
  • Restorative Defenses: Agent operators are advised to adopt domain-tuned scoring weights, enforce zero-trust policies wherever feasible, and deploy real-time anomaly watchers for deep trust pipelines. Empirical tracking of SecurityScore leaderboards can drive system hardening over time.
  • Benchmark Adoption: InjecAgent remains a foundational framework for standardized evaluation and comparison of agent security, serving as both a stress-test corpus for practitioner deployment and a reference point for the design of next-generation, context- and adversary-robust defense mechanisms.

7. Future Directions

The continuing evolution of agentic systems and adversarial techniques motivates broadening the benchmark along several axes:

  • Multi-turn and Long-Horizon: Incorporate dynamic, open-ended attack strategies mimicking real-world workflows and adaptive adversaries (Li et al., 3 Feb 2026, Jiang et al., 18 Feb 2026).
  • Graded and Partial Success: Move beyond binary ASR, reporting partial compromise and nuanced security–utility trade-off curves (Bhagwatkar et al., 6 Oct 2025).
  • Encoded and Multimodal Attacks: Systematically include attacks leveraging encoding, obfuscation, multimodal artifacts, and conversational priming (Chang et al., 26 Sep 2025, Liu et al., 1 Oct 2025).
  • Defense Module Registries: Support plug-and-play evaluation of automated defense modules and red-teamers, with cumulative leaderboards tracking progress against an expanding threat landscape (Jiang et al., 18 Feb 2026).
  • Integrated Security-by-Design: Drive research towards planners and tool-call orchestrators with baked-in integrity constraints and explicit task contracts, raising the bar against latent bridging and exploitation via attacker-controlled intermediate actions.

A plausible implication is that only continuous, adversary-aware expansion and adaptive benchmarking can ensure that agentic systems are evaluated and hardened against the practical realities of current and future injection threats.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to InjecAgent Benchmark.