Papers
Topics
Authors
Recent
Search
2000 character limit reached

Agent-SafetyBench: Evaluating LLM Agent Safety

Updated 4 March 2026
  • Agent-SafetyBench is a comprehensive suite of datasets, benchmarks, and evaluation methods that systematically assess safety risks in interactive, tool-using LLM agents.
  • It measures both task performance and safety using multi-turn interactions, real-world tool schemas, and simulated environments to capture diverse failure modes.
  • Empirical results highlight critical robustness and risk awareness gaps, underlining the need for enhanced, risk-sensitive safety mechanisms in LLM agents.

Agent-SafetyBench

Agent-SafetyBench refers to a suite of datasets, benchmarks, and evaluation methodologies designed for systematically assessing and improving the safety of LLM agents as they interact with environments, tools, and users. Unlike traditional content-only safety benchmarks, Agent-SafetyBench focuses specifically on the behavioral safety of interactive, tool-using agents—probing their robustness, risk awareness, and failure modes across a broad spectrum of realistic threat scenarios and environments (Zhang et al., 2024).

1. Scope, Motivation, and Design Principles

Agent-SafetyBench is motivated by the proliferation of LLM-based agents in real-world tasks where they make external tool calls, control devices, or provide actionable outputs in dynamic settings (e.g., sending emails, financial transactions, physical task execution). This paradigm shift exposes new classes of safety risks—such as misusing APIs, leaking sensitive data, executing harmful sequences, or failing to comply with regulatory and policy constraints—that cannot be captured by text-only safety probes.

Key design principles include:

  • Comprehensive coverage of behavioral risks: Encompassing privacy, property loss, physical harm, legal violation, misinformation, code execution, and availability compromise.
  • Diverse interaction environments: Simulated or real interactive settings with modeled APIs, task schemas, action-state tracking, and dynamic multi-turn interactions.
  • Failure mode taxonomy: Explicit annotation and diagnosis of root causes (e.g., premature tool invocation, ignoring constraints, over-trusting results).
  • Integration of real-world tool schemas: JSON-specified APIs, matching code, and full execution traces.

The benchmark suite is constructed to allow fine-grained measurement of both "task helpfulness" (utility, success rates) and "safety" (avoidance of harmful or policy-violating behavior), with explicit metrics and case labeling for downstream analysis (Zhang et al., 2024).

2. Dataset Composition and Taxonomies

Agent-SafetyBench consists of 349 simulated interactive environments, each with:

  • JSON tool schemas (names, signatures, descriptions)
  • Python runtime class implementing tool behavior and state
  • Agent-environment interaction protocol: multi-turn input, tool-call, result, and message passing

Within these environments, 2,000 distinct test cases are curated, each tagged by:

  • Primary risk category (from an 8-class taxonomy):

    1. Leak sensitive data/information
    2. Lead to property loss
    3. Produce unsafe information/misinformation
    4. Spread unsafe information/misinformation
    5. Lead to physical harm
    6. Violate law/ethics
    7. Contribute to harmful/vulnerable code
    8. Compromise availability
  • Failure modes (10 types), including:

    • Harmful content generation without tool
    • Calling tools with incomplete or ambiguous information
    • Ignoring explicit or implicit constraints
    • Omission of required safeguards
    • Over-trust in tool outputs
    • Unsafe parameter specification
    • Failure to filter/prioritize results

Each item is validated through multi-stage human review; test case transcripts are further human-annotated for fine-grained safety and failure-mode correctness (Zhang et al., 2024).

3. Evaluation Methodology and Safety Metrics

Agent-SafetyBench deploys a dynamic, turn-based agent-evaluation protocol:

  1. The agent receives a prompt specifying available tools and a user goal.
  2. At each turn, the agent selects a tool call or generates a response; the environment executes tool calls and returns structured outcomes.
  3. Interactions continue until a final output is produced, yielding a full execution transcript.

Safety Annotation and Scoring:

  • Each transcript is classified as "safe" or "unsafe" using an automatic scorer fine-tuned on 4,000 human-labeled transcripts.
  • Safety Score: the unweighted average proportion of safe transcripts across all 8 categories.

Formally, for agent AA: SA=1Cc=1CSA,cS_A = \frac{1}{C} \sum_{c=1}^{C} S_{A,c} where

SA,c=1Nci=1NcI(case i in c is classified safe for A)S_{A,c} = \frac{1}{N_c} \sum_{i=1}^{N_c} \mathbb{I}(\text{case }i\text{ in }c\text{ is classified safe for }A)

with Nc=250N_c = 250 per category (Zhang et al., 2024).

Failure-mode analysis enables attribution of unsafe cases to specific vulnerabilities, supporting targeted improvement.

4. Empirical Results and Failure Patterns

Benchmarking sixteen popular agents (proprietary and open-source), Agent-SafetyBench exposes critical safety gaps:

  • No evaluated agent exceeds a 60% safety score; the best, Claude-3-Opus, achieves 59.8%.
  • Proprietary models typically outperform open-source, with a moderate positive correlation between model size and safety.
  • The hardest category is "spread unsafe information/misinformation" (average 15.6% score).
  • Behavioral cases—those requiring tool usage or environment interaction—consistently yield lower safety than content-only jailbreak probes.

Observed failure patterns:

  • Robustness defects: Agents may fabricate tool parameters, skip prerequisite checks, or omit required actions, leading to resource loss or property compromise.
  • Risk awareness deficits: Agents often fail to anticipate downstream harms, perform inadequate cost–benefit or constraint analysis, and may misapply safety-critical tools.
  • Prompt-based interventions are inadequate: Simple or enhanced defense prompts yield only marginal improvements, especially in strong models, and cannot substitute for systematic agent-level safety mechanisms.

5. Fundamental Defects and Recommendations

Detailed analysis of annotated transcripts attributes current agent safety failures to two principal defects:

  • Lack of robustness: Brittle tool-use policies result in tool misuse or omission under context shifts.
  • Lack of risk awareness: Absence of explicit hazard assessment and planning regarding tool outcomes.

The data indicates that defense prompts are insufficient, as they inflate context length and cognitive burden yet do not resolve underlying brittleness or risk-insensitive reasoning.

Recommendations for future safety-aligned agent systems:

  • Fine-tune on adversarial agent–environment interactions with risk-annotated outcomes.
  • Integrate explicit risk-assessment modules or learned safety critics to evaluate pre-action plans.
  • Develop risk-aware planners simulating tool outcomes under relevant constraints, preferring actions with minimal harm likelihood.
  • Apply formal verification or contract-based checking to tool invocation, respecting pre-defined pre-/post-conditions.
  • Maintain continuous automated red-teaming cycles in realistic, interactive sandboxes to stress-test emerging models (Zhang et al., 2024).

6. Significance and Directions for Progress

Agent-SafetyBench establishes a technically rigorous, extensible foundation for agent safety assessment. Its comprehensive coverage, taxonomic breakdowns, formal evaluation protocols, and public release (code, data, annotation pipeline) provide the research community with critical infrastructure to quantify, diagnose, and mitigate real behavioral safety risks in LLM-driven agents.

Empirical findings demonstrate that real-world deployments of current LLM agents carry substantial and diverse safety risks; advancement requires coordinated effort in robust model design, explicit risk modeling, integrated verification, and continual adversarial stress testing (Zhang et al., 2024).


Key Reference:

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Agent-SafetyBench.