Papers
Topics
Authors
Recent
Search
2000 character limit reached

ATBench: Agent Trajectory Safety Benchmark

Updated 30 January 2026
  • ATBench is a benchmark suite that evaluates AI agent safety via trajectory analysis using multi-turn, tool-augmented scenarios and a three-axis safety taxonomy.
  • It synthesizes 500 unique trajectories with balanced safe/unsafe scenarios, leveraging over 1,500 distinct tool definitions to test model generalization.
  • The benchmark enables fine-grained risk diagnosis by categorizing unsafe behaviors across risk sources, failure modes, and real-world harms with rigorous quality control.

ATBench (Agent Trajectory Safety and Security Benchmark) is a trajectory-level evaluation suite created to systematically measure both binary safety and fine-grained risk-diagnosis capabilities of AI agents in open, tool-augmented environments. Its distinctiveness lies in multi-turn, tool-centric scenarios and a principled safety taxonomy, enabling researchers to attribute unsafe behaviors precisely and stress-test agent models for transparency, real-world applicability, and generalization to novel tools (Liu et al., 26 Jan 2026).

1. Conceptual Foundations and Three-Dimensional Taxonomy

ATBench is anchored in a unified, orthogonal taxonomy for agentic risk with three independent axes:

  • Risk Source (Where): Eight categories sourced from four classes—User Input (Malicious User Instruction/Jailbreak, Direct Prompt Injection), Environmental Observation (Indirect Prompt Injection, Unreliable/Misinformation), External Entities (Tool Description Injection, Malicious Tool Execution, Corrupted Tool Feedback), and Internal Logic (Inherent Agent/LLM Failures).
  • Failure Mode (How): Fourteen subcategories split between Behavioral Failure Modes (e.g., Unconfirmed/Over-privileged Action, Flawed Planning/Reasoning, Improper Tool Use, Insecure Execution, Procedural Deviation/Inaction, Inefficient Execution) and Output Content Failure Modes (e.g., Harmful/Offensive Generation, Instruction for Illegal Activity, Malicious Executables, Unauthorized Disclosure, Inaccurate Content).
  • Real-World Harm (What): Ten impact categories spanning Privacy & Confidentiality, Economic, System Integrity, Physical/Health, Psychological/Emotional, Interpersonal/Reputational, Societal/Info-ecosystem, Public Service/Resource, Fairness/Equity, Functional/Opportunity.

Because these axes are orthogonal, each unsafe trajectory receives a unique triplet label, forming an $8 × 14 × 10$ combinatorial risk space.

2. Dataset Synthesis and Curation Protocol

ATBench comprises 500 unique trajectories (250 safe, 250 unsafe), each with an average of 8.97 agent–environment exchanges and leveraging 1,575 distinct tool definitions. The construction pipeline consists of:

  • Planning: Sampling a tuple (Risk Source, Failure Mode, Harm Type), designating safe/unsafe, and selecting from a 10,000+ library of tools. A planner writes a free-form chain-of-thought embedding the risk, followed by a structured execution plan (user query, tool calls, injection points, mitigation path for safe traces).
  • Trajectory Synthesis: The Orchestrator alternates user queries, simulated tool responses (including malicious/corrupted outputs for unsafe traces), and LLM agent outputs that either act unsafely or mitigate accordingly.
  • Summarization and Quality Control: Each trajectory is finalized with a concise log of attack/mitigation, then filtered by deterministic validators and a specialized LLM-judge for semantic alignment to the sampled taxonomy labels. ~52% of candidates survive QC.

Trajectories are selected under strict tool-split conditions: none of the 2,292 tool definitions in ATBench exist in the training pool, ensuring assessment of genuine generalization rather than memorization.

3. Annotation, Data Statistics, and Coverage

A panel of four large models (Qwen-QwQ, GPT-5.2, Gemini 3 Pro, DeepSeek-V3.2) annotates each trajectory, producing binary safe/unsafe verdicts and triplet taxonomy labels. Binary labels are aggregated by majority vote; non-unanimous cases (227/500 “Hard”) are referred to human experts in double-blind adjudication. “Easy” (unanimous) cases undergo spot-checks, confirming high consensus.

Key dataset statistics:

Property Value Notes
Total trajectories 500 (250 safe / 250 unsafe)
Avg. turns/trajectory 8.97 Includes agent/env. messages
Unique tools invoked 1,575 None used in training pool
Risk Source coverage All 8 categories, ~uniform Fig.4 in paper
Failure Mode coverage All 14 categories
Harm coverage All 10 categories Includes high-impact, long-tail

4. Evaluation Protocol and Metrics

ATBench quantitatively evaluates two dimensions:

  • (a) Trajectory-level Safety Evaluation: For each trajectory TT, models predict y^{safe,unsafe}\hat{y}\in\{\text{safe}, \text{unsafe}\}. Standard binary classification metrics are used:
    • Accuracy: TP+TNTP+FP+TN+FN\frac{TP + TN}{TP + FP + TN + FN}
    • Precision: TPTP+FP\frac{TP}{TP + FP}
    • Recall: TPTP+FN\frac{TP}{TP + FN} (risk-detection rate)
    • F1: 2PrecisionRecallPrecision+Recall\frac{2 \cdot \mathrm{Precision} \cdot \mathrm{Recall}} {\mathrm{Precision} + \mathrm{Recall}}
  • (b) Fine-Grained Risk Diagnosis: For unsafe trajectories, models must attribute correct Risk Source, Failure Mode, and Harm category. Per-axis classification accuracy is reported:
    • Risk Source Accuracy: # correct# unsafe\frac{\text{\# correct}}{\text{\# unsafe}}
    • Failure Mode Accuracy: # correct# unsafe\frac{\text{\# correct}}{\text{\# unsafe}}
    • Harm Accuracy: # correct# unsafe\frac{\text{\# correct}}{\text{\# unsafe}}

No custom risk-score equations are employed beyond these standard metrics.

5. Illustrative Scenarios

Three cases typify the ATBench structure:

  • Prompt-Injection Safe Case: Indirect prompt injection in a tool’s “notes” field instructs agent to bypass authorization. The agent detects and refuses, resulting in “Safe.”
  • Prompt-Injection Unsafe Case: “SYSTEM OVERRIDE” inside a screenshot causes goal drift; agent executes an unauthorized WhatsApp message, labeled “Unsafe” with (Indirect Prompt Injection, Tool Misuse in Context, Functional & Opportunity Harm).
  • Unseen-Tool Corruption Case: A charging-station API produces corrupted instructions; agent fails to validate and creates unauthorized webhook (Corrupted Tool Feedback, Instruction for Harmful/Illegal Activity, Public Service & Resource Harm).

6. Baseline Performance and Comparative Results

ATBench is used to evaluate models such as AgentDoG variants, general LLMs, and specialized guard systems (R-Judge, ASSE-Safety, ShieldAgent, PolyGuard):

Model Accuracy Precision Recall F1 Risk Src Acc Fail Mode Acc Harm Acc
AgentDoG-Qwen3-4B 92.8% 90.5% 95.6% 93.0% 82.0% 32.4% 58.4%
ShieldAgent 70.9%
PolyGuard 77.0%
Gemini 3 Pro 36.8%
GPT-5.2 30.8%

AgentDoG-Qwen3-4B demonstrates state-of-the-art results, substantially outperforming competing guard models in both binary safety detection and fine-grained risk attribution.

7. Impact, Applications, and Research Directions

ATBench advances AI agent safety evaluation by addressing four main fronts:

  • Transitioning from single-turn moderation to complex, tool-augmented, multi-turn trajectories.
  • Employing a three-axis taxonomy to unify safety and security risk perspectives.
  • Enforcing an unseen-tool split to measure model generalization.
  • Requiring fine-grained risk attribution rather than binary verdicts.

Practical Applications:

  • Benchmarking diagnostic guardrails capable of root cause analysis across tool executions.
  • Supporting interpretable monitors that translate taxonomy labels into alignment or RL corrections.
  • Facilitating evaluation for multi-agent and multimodal systems with richer environmental interactions.

ATBench provides a rigorous, reproducible baseline for both safety detection and explanatory risk diagnosis. It is foundational for the development, deployment, and alignment of safer and more transparent agentic systems in real-world environments (Liu et al., 26 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ATBench Benchmark.