ATBench: Agent Trajectory Safety Benchmark

Updated 30 January 2026

ATBench is a benchmark suite that evaluates AI agent safety via trajectory analysis using multi-turn, tool-augmented scenarios and a three-axis safety taxonomy.
It synthesizes 500 unique trajectories with balanced safe/unsafe scenarios, leveraging over 1,500 distinct tool definitions to test model generalization.
The benchmark enables fine-grained risk diagnosis by categorizing unsafe behaviors across risk sources, failure modes, and real-world harms with rigorous quality control.

ATBench (Agent Trajectory Safety and Security Benchmark) is a trajectory-level evaluation suite created to systematically measure both binary safety and fine-grained risk-diagnosis capabilities of AI agents in open, tool-augmented environments. Its distinctiveness lies in multi-turn, tool-centric scenarios and a principled safety taxonomy, enabling researchers to attribute unsafe behaviors precisely and stress-test agent models for transparency, real-world applicability, and generalization to novel tools (Liu et al., 26 Jan 2026).

1. Conceptual Foundations and Three-Dimensional Taxonomy

ATBench is anchored in a unified, orthogonal taxonomy for agentic risk with three independent axes:

Risk Source (Where): Eight categories sourced from four classes—User Input (Malicious User Instruction/Jailbreak, Direct Prompt Injection), Environmental Observation (Indirect Prompt Injection, Unreliable/Misinformation), External Entities (Tool Description Injection, Malicious Tool Execution, Corrupted Tool Feedback), and Internal Logic (Inherent Agent/LLM Failures).
Failure Mode (How): Fourteen subcategories split between Behavioral Failure Modes (e.g., Unconfirmed/Over-privileged Action, Flawed Planning/Reasoning, Improper Tool Use, Insecure Execution, Procedural Deviation/Inaction, Inefficient Execution) and Output Content Failure Modes (e.g., Harmful/Offensive Generation, Instruction for Illegal Activity, Malicious Executables, Unauthorized Disclosure, Inaccurate Content).
Real-World Harm (What): Ten impact categories spanning Privacy & Confidentiality, Economic, System Integrity, Physical/Health, Psychological/Emotional, Interpersonal/Reputational, Societal/Info-ecosystem, Public Service/Resource, Fairness/Equity, Functional/Opportunity.

Because these axes are orthogonal, each unsafe trajectory receives a unique triplet label, forming an $8 × 14 × 10$ combinatorial risk space.

2. Dataset Synthesis and Curation Protocol

ATBench comprises 500 unique trajectories (250 safe, 250 unsafe), each with an average of 8.97 agent–environment exchanges and leveraging 1,575 distinct tool definitions. The construction pipeline consists of:

Planning: Sampling a tuple (Risk Source, Failure Mode, Harm Type), designating safe/unsafe, and selecting from a 10,000+ library of tools. A planner writes a free-form chain-of-thought embedding the risk, followed by a structured execution plan (user query, tool calls, injection points, mitigation path for safe traces).
Trajectory Synthesis: The Orchestrator alternates user queries, simulated tool responses (including malicious/corrupted outputs for unsafe traces), and LLM agent outputs that either act unsafely or mitigate accordingly.
Summarization and Quality Control: Each trajectory is finalized with a concise log of attack/mitigation, then filtered by deterministic validators and a specialized LLM-judge for semantic alignment to the sampled taxonomy labels. ~52% of candidates survive QC.

Trajectories are selected under strict tool-split conditions: none of the 2,292 tool definitions in ATBench exist in the training pool, ensuring assessment of genuine generalization rather than memorization.

3. Annotation, Data Statistics, and Coverage

A panel of four large models (Qwen-QwQ, GPT-5.2, Gemini 3 Pro, DeepSeek-V3.2) annotates each trajectory, producing binary safe/unsafe verdicts and triplet taxonomy labels. Binary labels are aggregated by majority vote; non-unanimous cases (227/500 “Hard”) are referred to human experts in double-blind adjudication. “Easy” (unanimous) cases undergo spot-checks, confirming high consensus.

Key dataset statistics:

Property	Value	Notes
Total trajectories	500 (250 safe / 250 unsafe)
Avg. turns/trajectory	8.97	Includes agent/env. messages
Unique tools invoked	1,575	None used in training pool
Risk Source coverage	All 8 categories, ~uniform	Fig.4 in paper
Failure Mode coverage	All 14 categories
Harm coverage	All 10 categories	Includes high-impact, long-tail

4. Evaluation Protocol and Metrics

ATBench quantitatively evaluates two dimensions:

(a) Trajectory-level Safety Evaluation: For each trajectory $T$ $T$ , models predict $\hat{y}\in\{\text{safe}, \text{unsafe}\}$ $y^∈{safe,unsafe}$ . Standard binary classification metrics are used:
- Accuracy: $\frac{TP + TN}{TP + FP + TN + FN}$
- Precision: $\frac{TP}{TP + FP}$
- Recall: $\frac{TP}{TP + FN}$ (risk-detection rate)
- F1: $\frac{2 \cdot \mathrm{Precision} \cdot \mathrm{Recall}} {\mathrm{Precision} + \mathrm{Recall}}$
(b) Fine-Grained Risk Diagnosis: For unsafe trajectories, models must attribute correct Risk Source, Failure Mode, and Harm category. Per-axis classification accuracy is reported:
- Risk Source Accuracy: $\frac{\text{\# correct}}{\text{\# unsafe}}$
- Failure Mode Accuracy: $\frac{\text{\# correct}}{\text{\# unsafe}}$
- Harm Accuracy: $\frac{\text{\# correct}}{\text{\# unsafe}}$

No custom risk-score equations are employed beyond these standard metrics.

5. Illustrative Scenarios

Three cases typify the ATBench structure:

Prompt-Injection Safe Case: Indirect prompt injection in a tool’s “notes” field instructs agent to bypass authorization. The agent detects and refuses, resulting in “Safe.”
Prompt-Injection Unsafe Case: “SYSTEM OVERRIDE” inside a screenshot causes goal drift; agent executes an unauthorized WhatsApp message, labeled “Unsafe” with (Indirect Prompt Injection, Tool Misuse in Context, Functional & Opportunity Harm).
Unseen-Tool Corruption Case: A charging-station API produces corrupted instructions; agent fails to validate and creates unauthorized webhook (Corrupted Tool Feedback, Instruction for Harmful/Illegal Activity, Public Service & Resource Harm).

6. Baseline Performance and Comparative Results

ATBench is used to evaluate models such as AgentDoG variants, general LLMs, and specialized guard systems (R-Judge, ASSE-Safety, ShieldAgent, PolyGuard):

Model	Accuracy	Precision	Recall	F1	Risk Src Acc	Fail Mode Acc	Harm Acc
AgentDoG-Qwen3-4B	92.8%	90.5%	95.6%	93.0%	82.0%	32.4%	58.4%
ShieldAgent	—	—	—	70.9%	—	—	—
PolyGuard	—	—	—	77.0%	—	—	—
Gemini 3 Pro	—	—	—	—	36.8%	—	—
GPT-5.2	—	—	—	—	—	—	30.8%

AgentDoG-Qwen3-4B demonstrates state-of-the-art results, substantially outperforming competing guard models in both binary safety detection and fine-grained risk attribution.

7. Impact, Applications, and Research Directions

ATBench advances AI agent safety evaluation by addressing four main fronts:

Transitioning from single-turn moderation to complex, tool-augmented, multi-turn trajectories.
Employing a three-axis taxonomy to unify safety and security risk perspectives.
Enforcing an unseen-tool split to measure model generalization.
Requiring fine-grained risk attribution rather than binary verdicts.

Practical Applications:

Benchmarking diagnostic guardrails capable of root cause analysis across tool executions.
Supporting interpretable monitors that translate taxonomy labels into alignment or RL corrections.
Facilitating evaluation for multi-agent and multimodal systems with richer environmental interactions.

ATBench provides a rigorous, reproducible baseline for both safety detection and explanatory risk diagnosis. It is foundational for the development, deployment, and alignment of safer and more transparent agentic systems in real-world environments (Liu et al., 26 Jan 2026).

Markdown Upgrade to Chat

References (1)

AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ATBench Benchmark.