Breaking Agent Backbones: Evaluating the Security of Backbone LLMs in AI Agents (2510.22620v1)

Published 26 Oct 2025 in cs.CR, cs.AI, and cs.LG

Abstract: AI agents powered by LLMs are being deployed at scale, yet we lack a systematic understanding of how the choice of backbone LLM affects agent security. The non-deterministic sequential nature of AI agents complicates security modeling, while the integration of traditional software with AI components entangles novel LLM vulnerabilities with conventional security risks. Existing frameworks only partially address these challenges as they either capture specific vulnerabilities only or require modeling of complete agents. To address these limitations, we introduce threat snapshots: a framework that isolates specific states in an agent's execution flow where LLM vulnerabilities manifest, enabling the systematic identification and categorization of security risks that propagate from the LLM to the agent level. We apply this framework to construct the $\operatorname{b}^3$ benchmark, a security benchmark based on 194331 unique crowdsourced adversarial attacks. We then evaluate 31 popular LLMs with it, revealing, among other insights, that enhanced reasoning capabilities improve security, while model size does not correlate with security. We release our benchmark, dataset, and evaluation code to facilitate widespread adoption by LLM providers and practitioners, offering guidance for agent developers and incentivizing model developers to prioritize backbone security improvements.

Summary

The paper introduces the threat snapshots framework to decouple LLM vulnerabilities from traditional risks in agentic systems.
It employs a b benchmark with 210 high-quality attacks collected via gamified crowdsourcing and context-dependent metrics.
Experimental results reveal that reasoning improves security while model size and cost are not reliable predictors of resilience.

Security Evaluation of Backbone LLMs in AI Agents via Threat Snapshots and the b Benchmark

Introduction and Motivation

The paper presents a formal framework and empirical benchmark for evaluating the security of backbone LLMs in agentic systems. The central challenge addressed is the entanglement of novel LLM-specific vulnerabilities with traditional software risks in AI agents, particularly given the non-deterministic, sequential nature of agentic workflows. Existing security evaluation frameworks either focus on narrow threat vectors or require modeling entire agent execution flows, limiting coverage and comparability. The authors introduce threat snapshots—an abstraction that isolates specific agent states where LLM vulnerabilities manifest—enabling systematic identification, categorization, and benchmarking of security risks that propagate from the LLM to the agent level.

Threat Snapshots: Formalization and Attack Taxonomy

Threat snapshots are defined as tuples capturing the agent state (including context, capabilities, and execution point) and the threat description (attack vector, objective, insertion method, and scoring function). This abstraction allows for precise modeling of LLM vulnerabilities, where an attacker with partial control over the context can manipulate outputs or agent behavior. The attack taxonomy is twofold:

Vector-objective categorization: Separates attacks by delivery method (direct/indirect) and goal (data exfiltration, content injection, decision manipulation, denial-of-service, system/tool compromise, content policy bypass).
Task-type categorization: Classifies attacks by affected LLM function (direct/indirect instruction override, tool invocation, context extraction, denial of AI service).

This dual categorization ensures comprehensive coverage of agentic security risks, facilitating both threat modeling and fine-grained benchmarking.

Benchmark Construction: Threat Snapshots and Crowdsourced Attacks

The b benchmark is constructed from 10 representative threat snapshots, each instantiated at three defense levels: minimal prompt constraints (L1), stronger system prompts/context (L2), and LLM-as-judge self-defense (L3). Attack data is collected via large-scale, gamified crowdsourcing, yielding 194,331 unique adversarial attacks, of which 10,935 were successful. The final benchmark uses the top 7 attacks per snapshot/level, selected for efficacy and diversity, resulting in 210 high-quality attacks. Attack scoring is performed via context-dependent metrics (e.g., ROUGE-L recall, profanity detection, embedding similarity), and model vulnerability is quantified as the mean attack success rate across threat snapshots and repetitions.

Experimental Evaluation: 31 LLMs Across Security Dimensions

The benchmark is applied to 31 popular LLMs, including both open- and closed-weight models, with and without enhanced reasoning capabilities. Each model is evaluated on all threat snapshots and attacks, with repeated trials to account for output stochasticity. Vulnerability scores are computed with bootstrap confidence intervals for statistical robustness.

Key Results

Reasoning improves security: Enabling reasoning (e.g., chain-of-thought, increased context tokens) consistently lowers vulnerability scores across most models, contradicting prior claims that reasoning increases susceptibility to adversarial attacks.
Model size does not correlate with security: No clear trend is observed between model parameter count and security; larger models do not consistently outperform smaller ones in resisting attacks.

Figure 1: Model ranking is preserved across threat snapshot slices; reasoning-enabled LLMs have lower vulnerability scores; overall ranking of all models by vulnerability score.

Closed-weight systems outperform open-weight models: The most secure models are closed-weight, likely due to additional system-level guardrails, though the best open-weight models approach the performance of older closed-weight models.
Release date and price have limited effect: Newer and more expensive models show only marginal improvements in security, with substantial overlap in vulnerability scores across release dates and price points.
Ranking robustness: The overall ranking is stable to variations in attack selection, aggregation method, and threat snapshot composition.
Figure 2: Spearman's rho rank correlation shows that overall rankings are robust to attack selection and benchmark construction choices.
No clear size-security trend: Within model families, vulnerability scores do not systematically decrease with increasing model size.
Figure 3: Vulnerability scores for differently sized models of the same families; no clear trend that larger models are more secure.
Temporal and economic trends are weak: Vulnerability scores show only slight improvement over time and with increased model cost.
Figure 4: Vulnerability score vs model release dates; only slight improvement over time.

Figure 5: (left) Vulnerability score vs total model parameters; (right) vulnerability score vs total price for 1 million input/output tokens; limited trends observed.

Crowdsourcing bias is minimal: Models targeted in the adaptive crowdsourcing round have similar vulnerability scores to those not targeted, indicating no strong bias in attack data collection.
Figure 6: Distribution of vulnerability scores for models included/excluded in adaptive crowdsourcing; no strong bias detected.
Defense level consistency: The most secure models remain so across all defense levels (L1, L2, L3), suggesting that backbone selection should not be heavily influenced by prompt strength or self-judging defenses.
Figure 7: Comparison of vulnerability scores across defense levels; best/worst models are consistent.
Task-type variability: Model security properties differ significantly across task types, indicating that backbone selection should be use-case specific.
Figure 8: Vulnerability scores by task type; model performance varies across attack categories.

Implications and Future Directions

The threat snapshot framework and b benchmark provide a rigorous, extensible methodology for evaluating LLM security in agentic contexts. The empirical findings challenge assumptions about the relationship between model size, reasoning, and security, and highlight the importance of context-specific evaluation. Practically, the benchmark enables developers to select backbone LLMs tailored to their agentic use case and incentivizes model providers to prioritize security improvements. The framework is adaptable to red-teaming, external defense evaluation, and multi-agent systems.

Limitations include the exclusion of utility and latency metrics, focus on backbone LLMs rather than full agentic systems, and the inability to model attack propagation beyond single agent states. Future work should integrate security-utility tradeoffs, extend threat snapshots to multi-step and cross-component attacks, and develop methods for evaluating the interaction of LLM vulnerabilities with traditional software risks.

Conclusion

This work formalizes the evaluation of backbone LLM security in AI agents via threat snapshots and the b benchmark, providing actionable insights and a robust empirical foundation for agentic security research. The findings demonstrate that reasoning capabilities enhance security, model size is not a reliable predictor, and security properties are highly context-dependent. The framework and benchmark set a new standard for systematic, fine-grained LLM security evaluation, with broad implications for agent development, deployment, and future research in adversarial robustness.

PDF Markdown

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Breaking Agent Backbones: A simple explanation

What is this paper about?

This paper looks at how secure AI agents are, depending on which LLM they use as their “backbone.” An AI agent is like a smart assistant that reads instructions, thinks step by step, and uses tools (like web search or code execution) to get things done. The authors introduce a new way to test and compare how easily these agents can be tricked or attacked. They call this method “threat snapshots,” and they use it to build a security benchmark that compares 31 popular LLMs.

What questions are the researchers trying to answer?

The paper focuses on simple, practical questions:

Which LLMs make AI agents more secure against attacks?
What kinds of attacks are most likely to work against LLM-powered agents?
Can we test security without having to model an entire complicated agent?
Do features like “reasoning mode” or model size make a model more secure?

How did they paper this?

The authors created a framework called “threat snapshots.” Think of each snapshot like pressing “pause” on an AI agent at a single step and zooming in on what the LLM sees and outputs at that moment. This helps them paper the exact point where a trick or attack might work—without simulating the whole agent from start to finish.

They then:

Defined what counts as an LLM-specific vulnerability (for example, when the model follows harmful instructions hidden inside a document it’s reading).
Built 10 realistic threat snapshots based on common agent tasks (like coding help, tool use, reading documents, or structured outputs).
Categorized attacks by how they are delivered and what they aim to achieve:
- Delivery: direct (the attacker talks to the model) or indirect (the attacker hides instructions inside files, web pages, or tool settings).
- Goals: stealing data, injecting content, manipulating decisions or behavior, causing denial-of-service, compromising tools/systems, or bypassing content policies.
Collected a very large set of real attacks through a gamified challenge (like a security “game” for humans to try to break agents), gathering 194,331 unique attacks. They selected the strongest 210 attacks to build their benchmark.
Scored each model by repeatedly testing how often attacks succeed, because LLMs can be non-deterministic (they don’t always give the exact same answer every time). They averaged scores and used statistical methods to estimate uncertainty.

To make the tests fair and informative, each threat snapshot had three levels:

L1: Minimal defenses (a simple system prompt).
L2: Stronger instructions and more normal/benign context.
L3: Adds an “LLM-as-judge” defense that checks outputs for problems.

What did they find, and why does it matter?

Here are the main results, explained simply:

Reasoning helps security. Models that have “reasoning mode” (where they think more carefully before answering) were generally harder to attack. This suggests that deeper thinking can help models spot tricks or stick to instructions better.
Bigger isn’t necessarily safer. Larger models did not consistently beat smaller ones in security. Size alone isn’t a reliable indicator of safety.
Closed-weight systems often did better. Models from providers that add extra guardrails (closed systems) tended to be more secure than open-weight models. However, this also reflects that closed systems often include extra safety layers beyond the base model.
Security is different from content safety. Some models excel at blocking unsafe content (like violent or toxic text) but aren’t the best at resisting manipulation attacks that change behavior or trigger tool calls. So you can’t judge overall security just by content filters.
The best models stayed strong across different defenses. Top models performed well whether defenses were minimal or stronger.
Security depends on task type. A model may be secure against direct instruction attacks but weaker against indirect ones (like hidden instructions in a file). That means choosing a backbone model should depend on the agent’s specific use case.
Newer and pricier models tended to be more secure, but the differences weren’t huge. Improvements exist, but they’re not dramatic across the board.

Overall, they evaluated 31 models, showing clear differences and offering a way to compare them on meaningful, real-world security risks.

What does this mean for the future?

The paper’s impact is practical:

It gives developers a clear, testable way to compare LLMs on security—just like people already compare models for accuracy or speed. This makes security a “first-class” metric.
Threat snapshots help security teams and builders focus on the exact moments where things can go wrong, making red-teaming (testing for attacks) more efficient.
The benchmark, dataset, and code are released publicly, encouraging LLM providers to improve and helping builders pick safer backbones for their agents.

The authors also note limitations:

They didn’t measure utility (how helpful the model is) or speed. Builders still need to balance security with usefulness and cost.
They focused on the LLM inside the agent, not on everything around it. Real systems have many moving parts—databases, retrieval systems, permissions—and those can also introduce risks.
Multi-step attacks that unfold across several actions or tools are harder to fully capture. Future work should explore how attacks spread through the whole agent system, not just the LLM call.

In short, this paper provides a simple, strong way to test how secure different LLMs are inside AI agents and shows that careful design (especially reasoning) can make a big difference.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a focused list of what remains missing, uncertain, or unexplored in the paper, written to guide future research:

Coverage completeness of threat snapshots
- Only 10 text-based snapshots; no formal coverage guarantees over the space of agent tasks, attack vectors, or objectives beyond the curated set.
- No evaluation of non-text modalities (e.g., image-, audio-, or PDF-embedded injections), despite growing multi-modal agent use.
- Single-language (English) focus; robustness across languages, code-switching, and locale-specific conventions is untested.
- Domain-specific agents (finance, healthcare, legal, safety-critical) are not represented; transferability to high-stakes domains is unknown.
- Long-range, multi-step, and stateful attacks are not evaluated end-to-end; composition of snapshots into realistic attack chains remains theoretical.
Representativeness and bias in attack collection and selection
- Attackers were randomly assigned to 1 of 7 backbone LLMs; selected “top” attacks were those that averaged high performance across these 7, potentially biasing toward broadly effective attacks and missing model-specific weaknesses.
- Only 7 attacks per snapshot-level (210 total) were retained from 194k; the number required for stable, generalizable rankings is not characterized.
- Closed beta pool (947 users) may not reflect adversarial sophistication or diversity found in real-world attackers.
- Withheld strongest attacks limit full reproducibility and external validation; the gap between public and private attack sets is large and not systematically quantified.
Scoring and judging reliability
- Heavy reliance on LLM-as-judge; no inter-judge agreement studies, calibration, or human adjudication audits to validate correctness and reduce judge-induced bias.
- In L3, using the same backbone model as both target and judge may bias scores (self-judging effects); impact on ranking is not isolated.
- Cross-snapshot score comparability is assumed; severity normalization and calibration across heterogeneous objectives is not established.
Metric design and real-world risk mapping
- Equal weighting across snapshots and attacks ignores real-world likelihood and impact; no severity- or probability-weighted risk metrics.
- Vulnerability score aggregates success rates, not expected harm; no end-to-end consequence modeling (e.g., tool execution leading to actual data loss or RCE).
- No validated correlation between benchmark scores and real-world incident rates or losses; external validity remains unproven.
Non-determinism and decoding controls
- Only N=5 generations per example; sensitivity to temperature, sampling strategy, and seed variance is not analyzed.
- Decoding settings may differ across providers, especially for closed systems; comparability under divergent generation defaults is uncertain.
Confounds in open- vs closed-weight comparisons
- Closed-weight systems likely include undisclosed guardrails and safety layers; the benchmark mixes system-level defenses with base-model behavior, preventing apples-to-apples comparisons.
- No methodology to isolate or normalize the backbone’s intrinsic security from provider-side middleware and policies.
Reasoning vs security: mechanisms and trade-offs
- Observed security gains from enabling reasoning lack causal analysis; unclear whether improvements stem from more conservative policies, longer deliberation, or other provider changes.
- Cost/latency trade-offs of reasoning (and how many reasoning tokens are required for security gains) are not quantified.
- Cases where reasoning reduces security (tiny models) are noted but not investigated; thresholds and failure modes are unknown.
Model size and architecture effects
- “Size doesn’t correlate with security” is based on heterogeneous families; effects of architecture, pretraining data, RLHF style, and safety-tuning are not controlled.
- Scaling laws for security (if any) remain an open question.
Temporal robustness and update drift
- No assessment of ranking stability across time as providers update models and guardrails.
- Risk of benchmark contamination if vendors train on released attacks; defenses against overfitting and strategies for continual, surprise evaluations are not specified.
External defenses and full pipeline security
- No evaluation of standard agent defenses (e.g., retrieval sanitization, content filtering, policy splitting, tool whitelists, sandboxing); threat snapshot utility for defense benchmarking remains to be demonstrated.
- Retrieval likelihood is not modeled; the benchmark measures vulnerability conditional on an attack being present in context, not the probability that the system ingests it.
- Interaction effects between components (RAG, memory, caches, orchestrators) and cross-step propagation are unmeasured.
Tool-calling realism and exploitation
- Tool invocations are evaluated via outputs/plans, not via actual execution environments; real exploitability (e.g., code execution, data exfiltration) and sandbox effectiveness remain untested.
- No modeling of OS-, network-, or browser-level side effects, and no measurement of post-exploitation persistence or lateral movement.
Context management and memory behaviors
- Effects of context-window limits, truncation, memory persistence, and history caching on vulnerability are not evaluated.
- Attacks exploiting memory/state leakage or long-context overwrites are not separately analyzed.
Multimodal and cross-channel injections
- No coverage of prompt injections delivered via images (e.g., steganography), documents with embedded objects, or OCR artifacts.
- Cross-channel attacks (email → document → browser → agent) and supply-chain vectors (plugins/tools) are not modeled.
Data exfiltration realism
- Use of synthetic secrets or simulated sensitive content is not detailed; fidelity to real exfiltration scenarios (e.g., authentication tokens, PII) is unclear.
Benchmark governance and reproducibility
- Withheld attack subset and opaque provider policies hamper exact replication; a plan for third-party verification or challenge phases is not provided.
- Bootstrap CIs assume resampling independence structures that may not hold (e.g., correlated attacks or judge biases); uncertainty quantification may be optimistic.
Safety–security disentanglement
- Security vs safety distinctions blur for “content policy bypass” tasks; how much measured variance is pure security vs alignment differences is not disentangled.
Cross-benchmark alignment
- No empirical mapping to other agent-security benchmarks (AgentDojo, Inje(c)Agent, WASP, AgentHarm); convergent validity and complementarity remain untested.
Guidelines for operationalization
- How developers should weight snapshot categories for specific use cases, or translate scores into deployment decisions and risk SLAs, is not specified.
- No methodology for adapting the benchmark to enterprise-specific threat models, toolchains, or compliance requirements.
Ethical and legal considerations
- Publishing strong attacks risks dual-use; the paper removes top-tier attacks but does not provide a principled framework for safe, useful release schedules.
- Legal/compliance implications of benchmarking across proprietary systems (and how to report vulnerabilities to vendors) are not discussed.
Open methodology questions
- How many attacks per snapshot are sufficient for stable rankings?
- How to design adaptive evaluation that remains robust to vendor pretraining on public attacks?
- Can chain-of-snapshots benchmarking quantify compounding risk across multi-turn agent workflows?
- What normalization or auditing is needed to ensure fair comparison across closed and open systems with divergent guardrails?

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

The paper’s threat snapshot framework and b benchmark enable practitioners to act now across development, evaluation, and governance of LLM-powered agents. The following applications are deployable with current tools and the released benchmark/code.

Backbone selection and procurement using vulnerability scores
- Use case: Select the backbone LLM for an agent based on b benchmark vulnerability scores, with emphasis on task types relevant to the agent (e.g., indirect instruction override for RAG, indirect tool invocation for tool-heavy agents).
- Sectors: software platforms, finance, healthcare, customer support, education, legal, government.
- Tools/workflows: RFP templates that require V(m, T) thresholds per task type; model cards augmented with per-slice security scores; decision matrices balancing utility benchmarks with security slices.
- Assumptions/dependencies: Access to benchmark code/data; scores reflect your agent’s context; closed-weight systems include provider guardrails (not apples-to-apples with open weights).
CI/CD security regression testing for agent releases
- Use case: Add a “Security Gate” stage that runs threat snapshots matching your agent states whenever you change prompts, tools, retrieval chains, system prompts, or model versions.
- Sectors: software, MLOps platforms, enterprise IT.
- Tools/workflows: GitHub Actions/GitLab CI stage invoking threat snapshot evaluations; per-release diff of vulnerability scores; fail-build on regression beyond tolerance; model upgrade prechecks.
- Assumptions/dependencies: Compute budget for repeated runs to account for LLM non-determinism; ability to reconstruct agent states into model contexts.
Agent threat modeling with snapshots (complement to STRIDE/LINDDUN)
- Use case: Extend existing threat modeling to enumerate vector–objective pairs and map them to concrete agent states, then instantiate snapshots for systematic red-teaming.
- Sectors: all sectors deploying agents (especially regulated domains).
- Tools/workflows: “Snapshot Builder” templates that capture context, insertion points, scoring; checklists covering DIO/IIO/DTI/ITI/DCE/DAIS; mapping to NIST AI RMF risk functions.
- Assumptions/dependencies: Engineer time to codify states; clear system/assistant prompts and tool definitions; representative data in context.
Reasoning-mode policy for security-critical flows
- Use case: Enable reasoning mode for agent steps where the benchmark shows security gains (e.g., at tool-invocation or safety-gating steps), while avoiding it on tiny models where it may not help.
- Sectors: finance (trading/compliance agents), healthcare (EHR assistants), enterprise automations.
- Tools/workflows: Per-step policy in agent orchestrator: reasoning_on for high-risk steps; telemetry to monitor token/cost tradeoffs.
- Assumptions/dependencies: Reasoning improves security for most models but not “tiny” variants; cost/latency budgets.
Security-aware model routing
- Use case: Route tool-calling tasks to models with strong DTI/ITI performance; route content moderation/jailbreak-prone tasks to models with strong content-policy robustness slices.
- Sectors: LLM gateways, platform providers, contact centers.
- Tools/workflows: Router that picks the model by task type security slice; AB tests comparing utility vs. security tradeoffs.
- Assumptions/dependencies: Stable benchmark slices across models; routing overhead and cost.
Targeted red teaming and gamified attack collection
- Use case: Run internal “Agent Breaker” events to collect organization-specific attacks on snapshots that mirror your agents; update defenses and alerts with high-efficacy samples.
- Sectors: model providers, enterprises, cloud vendors.
- Tools/workflows: Copy the Gandalf-style challenge flow; leaderboard; feedback loops to prompt hardening; judge-based scoring.
- Assumptions/dependencies: Responsible handling of potent attacks; avoid overfitting to a fixed attack set.
Prompt and tool defenses aligned to snapshots
- Use case: Apply snapshot-specific mitigations: stronger system prompts (L2), per-tool allowlists, output schema validators, out-of-band confirmations for sensitive actions, tool argument sanitization.
- Sectors: all agent builders; especially finance/healthcare/government.
- Tools/workflows: Structured output validators; capability-based tool permissions; input sanitation for retrieved context; LLM-as-judge gating (L3) for high-risk outputs.
- Assumptions/dependencies: LLM-as-judge can be error-prone; use a different judge model or ensemble when feasible; ensure human-in-the-loop for critical actions.
Security observability for agents (SOC integration)
- Use case: Monitor attack types (DIO/IIO/DTI/ITI/DCE/DAIS) in production; create incident runbooks keyed to snapshot categories; alert on anomalous tool invocations or context-extraction patterns.
- Sectors: enterprise security, MSSPs, platform providers.
- Tools/workflows: Per-step logs of context/tool calls; detectors keyed to snapshot-defined indicators; dashboards tracking attack prevalence and success rates.
- Assumptions/dependencies: Privacy-aware logging; robust labeling of tool calls and outputs.
Policy and compliance today (risk management and procurement)
- Use case: Require b benchmark results in supplier declarations; tie model upgrades to documented security regression tests; document per-slice risks in AI risk registers.
- Sectors: government, finance, healthcare, critical infrastructure.
- Tools/workflows: Procurement checklists; risk committee reviews using snapshot-based scores; periodic re-evaluation schedule.
- Assumptions/dependencies: Availability of independent or internal test runs; mapping to existing frameworks (e.g., NIST AI RMF).
Training and education
- Use case: Use snapshots as hands-on labs to upskill engineers on LLM security failure modes; teach distinctions between safety (toxicity) and security (exploitation).
- Sectors: academia, enterprise L&D, bootcamps.
- Tools/workflows: Workshop kits with sample snapshots and attacks; evaluation sandboxes.
- Assumptions/dependencies: Curated datasets and safe environments.
SMB and daily-life automations
- Use case: Before deploying personal/business automations (e.g., email triage, spreadsheet agents, Zapier-based bots), test them with relevant snapshots (especially IIO/ITI) to catch prompt injections from external content.
- Sectors: small businesses, prosumers.
- Tools/workflows: Lightweight snapshot templates for common automations; pick default assistants with stronger relevant slices.
- Assumptions/dependencies: Simplified tooling; awareness of risks from untrusted content.

Long-Term Applications

The framework points toward standards, products, and research directions that require further development, scaling, or coordination.

Agent security certification and labeling
- Use case: Third-party labs certify agents/models against a standardized suite of threat snapshots with sector-specific thresholds; publish “Agent Security Score” labels.
- Sectors: regulators, insurers, marketplaces, app stores.
- Dependencies: Consensus on snapshot suites; accreditation bodies; anti-overfitting governance for test sets.
Insurance underwriting and risk transfer
- Use case: Cyber insurers incorporate vulnerability scores and task-type profiles into underwriting for agent-powered services.
- Sectors: insurance, finance, enterprise IT.
- Dependencies: Historical loss data linking snapshot weaknesses to incidents; standardized reporting.
Robustness-oriented model training
- Use case: Pretraining/fine-tuning regimes that explicitly optimize for snapshot robustness (e.g., adversarial training on IIO/ITI/DAIS), with curriculum by task type.
- Sectors: model providers, open-source communities.
- Dependencies: Scalable, diverse attack generation; avoiding catastrophic utility regressions.
Automated attack generation and continual benchmarking
- Use case: LLM-based fuzzers that synthesize high-efficacy attacks per snapshot, continuously updating a moving-target benchmark and reducing reliance on manual red teaming.
- Sectors: research, platform security, cloud providers.
- Dependencies: Safe handling and governance; robust evaluation harnesses; defense–attack co-evolution management.
Secure tool-calling protocols and “proof-carrying” actions
- Use case: Typed, capability-based tool APIs with policy proofs attached to each invocation; sandboxed execution with pre/post-conditions validated by independent judges.
- Sectors: software, robotics, finance, healthcare.
- Dependencies: Standardized action schemas; verifiable policy languages; integration with TEEs or container sandboxes.
Multi-step and multi-agent attack composition analysis
- Use case: Formal methods and causal analyses to chain snapshots into end-to-end proofs of exploitability (e.g., Crescendo/multi-agent attacks), enabling holistic guarantees.
- Sectors: academia, security research, critical infrastructure.
- Dependencies: Formal models of agent state transitions; compositional verification tools.
Cross-modal and embodied-agent extensions
- Use case: Threat snapshots for vision, audio, and sensor streams; robotics agents interacting with physical tools.
- Sectors: autonomous vehicles, manufacturing, assistive robotics, media.
- Dependencies: Modality-specific scoring and insertion functions; simulation environments.
Security-aware agent OS/runtime
- Use case: Agent runtimes that isolate per-step contexts, enforce least-privilege on tools, embed snapshot-based risk scoring, and pause/route/escalate when risk spikes.
- Sectors: platform engineering, cloud providers.
- Dependencies: New orchestration abstractions; performance engineering; developer adoption.
Marketplaces and routers optimizing for security slices
- Use case: Model marketplaces surface per-slice security metrics; routers automatically choose models that minimize risk for each agent step subject to cost/latency constraints.
- Sectors: LLM gateways, app stores, integrators.
- Dependencies: Standardized metrics; trust in third-party evaluations; economic incentives.
Regulatory baselines and audits
- Use case: Regulators require periodic snapshot-based evaluations, reporting per attack category, and incident disclosure tied to task types.
- Sectors: government, regulated industries.
- Dependencies: Harmonization with NIST/ISO; auditor capacity; protection against test gaming.
Consumer and enterprise assistants with self-audit
- Use case: Agents that automatically run judge checks or secondary models on sensitive steps (e.g., data exfiltration risks, tool invocations), with user-facing explanations and escalation.
- Sectors: productivity software, SaaS automation, customer support.
- Dependencies: Reliable judge models; UX for delays and escalations; privacy constraints.
Sector-specific test suites and controls
- Use case: Curated snapshot packs for healthcare (EHR retrieval and PHI exfiltration), finance (trading/compliance tool use), legal (document injections), energy/ICS (control API invocations).
- Sectors: healthcare, finance, legal, energy, public sector.
- Dependencies: Domain data and simulators; specialized scoring functions; regulatory alignment.
Dataset stewardship and overfitting controls
- Use case: Tiered release of strong attacks through evaluation APIs; rotating hidden test sets; governance to deter benchmark gaming.
- Sectors: research consortia, standards bodies.
- Dependencies: Funding and governance; community norms; legal/ethical policies.
Foundational research directions
- Use case: Explain why reasoning tends to improve security; characterize when size does/doesn’t help; build causal models of snapshot susceptibility; new metrics beyond average scores.
- Sectors: academia, industrial research labs.
- Dependencies: Access to compute, datasets, and cross-institution collaboration.

Notes on feasibility and constraints across applications:

Scope: The benchmark evaluates backbone LLM susceptibility at single steps; it does not model the likelihood of retrieving/inserting malicious content or downstream tool impacts. End-to-end risk requires system-level analysis.
Distribution shift: Your agent’s contexts and tools may differ from the benchmark; adapt snapshots accordingly.
LLM-as-judge reliability: Judge models can be brittle; consider diverse judges or human review for critical actions.
Modalities: Current benchmark focuses on text; cross-modal expansion is future work.
Closed vs open weights: Closed systems often include extra guardrails; scores reflect system-level security, while open-weight evaluations often reflect base-model behavior.

View Paper Prompt View All Prompts

Glossary

Attack insertion: The procedure that embeds a malicious payload into the model’s input context to create a poisoned variant for evaluation. "Attack insertion: A function that takes an attack $a$ and the context $C_t$ and outputs the poisoned context $C_t^p(a)$ ."
Attack scoring: A scoring function that determines how successfully the model output fulfills the attacker’s objective. "Attack scoring: A function that takes the model output $O_t^p(a)$ from the poisoned context $C_t^p(a)$ and provides a score in $[0, 1]$ for how well the attack achieves its objective, that is, how close $O_t^p(a)$ is to the intended output of the attacker."
Attack vector: The delivery mechanism by which an attack reaches the model (e.g., direct input or via poisoned documents). "This categorization distinguishes attacks by their delivery method (attack vector) and their goal (attack objective)."
b benchmark: An evaluation suite for agent security that pairs threat snapshots with high-quality crowdsourced attacks. "the b benchmark, a security benchmark based on 194331 unique crowdsourced adversarial attacks."
Backbone LLM: The primary LLM that powers an agent and determines its behavior and security properties. "we aim to systematically understand how the choice of the backbone LLM in an AI agent affects its security."
Black-box: A system whose internal workings are not visible, making outputs difficult to predict or reason about. "non-deterministic black-box outputs from the backbone LLMs"
Crowdsourcing: Collecting data or solutions from large numbers of contributors, often via a public or gamified platform. "We therefore gather high-quality adapted attacks through large-scale crowdsourcing using a gamified red teaming challenge built around the threat snapshots."
Cross-site scripting: A web security vulnerability where malicious scripts are injected into trusted content, here noted as a traditional risk relevant to agents. "permission mismanagement or cross-site scripting"
Data exfiltration: Unauthorized extraction of sensitive information from the model’s context or system. "We divide attack objectives into six main categories: data exfiltration, content injection, decision and behavior manipulation, denial-of-service, system and tool compromise and content policy bypass."
Denial of AI service (DAIS): An attack type aimed at preventing the model from providing its intended service. "We consider six categories: direct instruction override (DIO), indirect instruction override (IIO), direct tool invocation (DTI), indirect tool invocation (ITI), direct context extraction (DCE) and denial of AI service (DAIS)."
Denial-of-service: An attack objective focused on disrupting or halting the agent’s operation. "We divide attack objectives into six main categories: data exfiltration, content injection, decision and behavior manipulation, denial-of-service, system and tool compromise and content policy bypass."
Direct context extraction (DCE): A task-type attack where the model is induced to reveal information directly from its context. "We consider six categories: direct instruction override (DIO), indirect instruction override (IIO), direct tool invocation (DTI), indirect tool invocation (ITI), direct context extraction (DCE) and denial of AI service (DAIS)."
Direct instruction override (DIO): A task-type attack where the attacker’s direct inputs override the model’s intended instructions. "We consider six categories: direct instruction override (DIO), indirect instruction override (IIO), direct tool invocation (DTI), indirect tool invocation (ITI), direct context extraction (DCE) and denial of AI service (DAIS)."
Direct tool invocation (DTI): A task-type attack that forces the model to call tools directly in service of the attack objectives. "We consider six categories: direct instruction override (DIO), indirect instruction override (IIO), direct tool invocation (DTI), indirect tool invocation (ITI), direct context extraction (DCE) and denial of AI service (DAIS)."
Guardrails: Protective pre- and post-processing layers or policies applied around the model to mitigate harmful behavior. "e.g., guardrails deployed by model providers."
Indirect injections: Attack techniques that embed malicious instructions within content the model consumes (e.g., documents or web pages). "by only considering indirect injections, remote code execution or other more restricted attack vectors."
Indirect instruction override (IIO): A task-type attack where instructions hidden in consumed content cause the model to deviate from its intended behavior. "We consider six categories: direct instruction override (DIO), indirect instruction override (IIO), direct tool invocation (DTI), indirect tool invocation (ITI), direct context extraction (DCE) and denial of AI service (DAIS)."
Indirect tool invocation (ITI): A task-type attack that induces the model to call tools based on instructions embedded indirectly in its inputs. "We consider six categories: direct instruction override (DIO), indirect instruction override (IIO), direct tool invocation (DTI), indirect tool invocation (ITI), direct context extraction (DCE) and denial of AI service (DAIS)."
LLM-as-judge: A defense setup where an LLM evaluates another LLM’s outputs for compliance or safety. "(iii) a level denoted by $L3$ that adds an LLM-as-judge defense, using the same backbone $m$ as judge, to $L1$ ."
Model context: The full set of inputs (e.g., prompts, messages, tool definitions) provided to the model at a specific step. "State model context: The full (non-poisoned) model context $C_t$ that would be passed to the backbone LLM at time $t$ ."
Non-deterministic: Behavior where repeated runs can produce different outputs due to stochasticity or sampling. "non-deterministic sequential nature of AI agents"
Non-parametric bootstrap: A resampling method used to estimate uncertainty (e.g., confidence intervals) without assuming a parametric form. "We further propose to estimate the uncertainty in the vulnerability score using a non-parametric bootstrap"
Poisoned context: An input context that has been altered by an attack to change the model’s behavior. "outputs the poisoned context $C_t^p(a)$ "
Red teaming: Expert-driven adversarial testing to discover vulnerabilities through crafted attacks. "the gold-standard remains manual red teaming, which does not scale."
Remote code execution: A severe vulnerability where an attacker gains the ability to run code on a target system. "by only considering indirect injections, remote code execution or other more restricted attack vectors."
Retrieval-augmented generation (RAG): A technique where external documents are retrieved and fed into the model to inform generation; security depends on retrieval integrity. "from a given RAG implementation"
Statelessness: Treating the model as not retaining internal state across steps, relying solely on the provided context each call. "This statelessness is conceptual and does not restrict generality, as any model that maintains state through techniques like history caching can be modified to accept the full context on each call without changing its behavior."
Structured outputs: Model responses constrained to specific formats (e.g., JSON), which can interact with security properties. "including tool-calling and structured outputs"
Threat snapshot: A formalized instance capturing a specific agent state, attack delivery, and evaluation method for an LLM vulnerability. "we introduce threat snapshots: a framework that isolates specific states in an agent's execution flow where LLM vulnerabilities manifest, enabling the systematic identification and categorization of security risks that propagate from the LLM to the agent level."
Tool-calling: The capability of an LLM to invoke external tools or APIs during task execution. "including tool-calling and structured outputs"
Tool definitions: Descriptions of available tools and their interfaces provided in the model context to enable tool use. "and tool definitions and returns a (model) output consisting of either a text response, a tool call or both."
Vector-objective categorization: A taxonomy that classifies attacks by delivery method (vector) and intended goal (objective). "We propose two complementary categorizations: a vector-objective categorization based on attack vector and objective that facilitates threat modeling of individual AI agents"
Vulnerability score: An aggregate metric quantifying how susceptible a model is to a set of threat snapshots and attacks. "we define the vulnerability score for LLM $m$ on $\mathcal{T}$ by"

View Paper Prompt View All Prompts

Open Problems

We found no open problems mentioned in this paper.

Continue Learning

Authors (7)

Collections

Tweets

This paper has been mentioned in 8 tweets and received 16 likes.

Upgrade to Pro to view all of the tweets about this paper:

Start a free 7-day Pro trial

YouTube

Show All Videos

Breaking Agent Backbones: Evaluating the Security of Backbone LLMs in AI Agents (2510.22620v1)

Summary

Security Evaluation of Backbone LLMs in AI Agents via Threat Snapshots and the b Benchmark

Introduction and Motivation

Threat Snapshots: Formalization and Attack Taxonomy

Benchmark Construction: Threat Snapshots and Crowdsourced Attacks

Experimental Evaluation: 31 LLMs Across Security Dimensions

Key Results

Implications and Future Directions

Conclusion

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Breaking Agent Backbones: A simple explanation

What is this paper about?

What questions are the researchers trying to answer?

How did they paper this?

What did they find, and why does it matter?

What does this mean for the future?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Related Papers

Authors (7)

Collections

Tweets

YouTube