- The paper introduces a taxonomy-driven, long-horizon safety benchmark that uncovers subtle, trajectory-level risks in LLM-based agents.
- It integrates a heterogeneous tool ecosystem with scenario synthesis and rigorous human audits to ensure realistic risk evaluation.
- Experimental results show advanced models struggle with fine-grained diagnosis of risk sources and failure modes, highlighting new challenges.
ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis
Motivation for Trajectory-Level Safety Evaluation
As LLM-based agents increasingly operate in long-horizon, interactive settings with complex tool use and external observations, safety risks are no longer adequately captured by prompt-level or response-level evaluation. Safety failures frequently emerge gradually throughout extended multi-turn interactions—as a result of compounding reasoning errors, unsafe tool invocation, misinterpreted user/environmental cues, or exploitation of carried-over state and permissions. The paradigm shift from isolated queries to persistent agent trajectories necessitates comprehensive benchmarks capable of revealing both overt and subtle safety risks, as well as providing the granularity required for causal diagnosis.
Limitations of Prior Benchmarks
Previous trajectory-level agent safety benchmarks suffer from key limitations:
- Insufficient interaction diversity: Restricted tool ecosystems and narrow scenario templates lead to benchmarks that fail to realistically stress agent capabilities.
- Limited observability and interpretability: Coarse binary labels obscure the mechanisms underlying safety failures, impeding root-cause analysis.
- Lack of long-horizon realism: Short, synthetic, or purely local scenarios underestimate the prevalence of delayed and context-dependent risks.
ATBench Construction and Methodology
ATBench introduces a structured framework to address these deficits by maximizing scenario diversity while retaining realism through a carefully engineered taxonomy-driven generation process and quality control pipeline.
Three-Dimensional Safety Taxonomy
Central to the benchmark is a hierarchical, three-axis taxonomy:
- Risk Source: Categorizes origins of unsafe conditions—user input (including direct/jailbreak instructions and prompt injection), environmental observations (such as indirect injections, unreliable or misleading information), external tools/APIs (including tool description injection, malicious tool execution, corrupted feedback), and internal agent failures (hallucination, flawed reasoning).
- Failure Mode: Annotates how risks manifest—spanning behavioral anomalies (over-privileged action, procedural deviation, insecure interactions, improper tool use) and output-content failures (harmful generation, unauthorized disclosure, misinformation).
- Real-World Harm: Captures downstream impact—ranging from privacy/privacy and financial loss to security, physical, societal, or opportunity harms.
This taxonomy allows for controlled sampling of risk types, ensuring comprehensive coverage of the agentic risk landscape and supporting fine-grained diagnostic evaluation.
Figure 2: Overview of the three orthogonal dimensions of the agentic safety taxonomy guiding ATBench data generation and diagnosis.
The scenario generator draws upon a large, normalized, and deduplicated pool of over 2,000 tools. This pool combines high-fidelity APIs from ToolBench, ToolAlpaca, and simulated resources specifically designed to expose underrepresented failure and risk regions.
Taxonomy-Guided Data Generation Pipeline
The data generation engine instantiates the taxonomy in concrete multi-step agent trajectories via planner-based scenario synthesis, which sequentially injects risk triggers, invokes diverse tools, simulates environmental feedback, and records agent responses in paired safe/unsafe variants. The protocol specifically supports long-context dependencies, enabling risks to be introduced early and only realized several steps later.
Figure 3: Data generation engine illustrating taxonomy-driven sampling, risk injection, sequential simulation, and quality filtering in ATBench.
Filtering and Human Audit
Rule-based checks (type/schema validation, API consistency) and LLM-based plausibility filters remove structurally invalid or unrealistic artifacts. A thorough, five-reviewer human audit further corrects both binary and fine-grained misannotations, resulting in 503 safe and 497 unsafe trajectories, each averaging 9 turns and 4K tokens—significantly longer and richer than prior datasets.
Experimental Results and Analysis
Safety Assessment Tasks
ATBench evaluates both binary trajectory-level safety classification (safe/unsafe) and fine-grained diagnosis across the three taxonomy axes for unsafe cases. Experiments consider proprietary frontier models (GPT-5.4, Gemini-3.1-Pro), strong open-source LLMs (Qwen3.5, Llama3.1-8B), and specialized guard models (LlamaGuard, ShieldAgent, AgentDoG-Qwen3).
Difficulty Assessment
All models demonstrate markedly reduced performance on ATBench relative to previous benchmarks such as R-Judge or ASSE-Safety. For instance, GPT-5.4 achieves only 76.7% F1 on binary safe/unsafe classification and 33.6% accuracy on risk source diagnosis, with specialized trajectory-trained guard models (AgentDoG-Qwen3-4B) attaining 46.8% risk source and 40.6% real-world harm identification (Figure 4, 5). Meanwhile, general-purpose open-source models consistently score lower still.
Figure 3: Model trajectory-level accuracy is lower on ATBench than previous benchmarks, showcasing increased difficulty and diversity.
Figure 6: Breakdown of per-leaf-category accuracy for four top models along the risk source, failure mode, and harm axes; high variance highlights unsolved hard cases.
Fine-Grained Diagnosis Bottlenecks
All evaluators reliably detect that a trajectory is unsafe when the final consequence is visible, but struggle to accurately identify the underlying risk source and failure mode. Tool-mediated risks, tool description injection, and indirect prompt injections are especially challenging; even strong models typically over-assign more tractable categories (e.g., user-driven or output failures).
Figure 4: Case studies: All models detect a trajectory as unsafe but consistently mislabel the causal category, illustrating diagnostic limitations.
Category-wise breakdowns reveal the greatest accuracy for direct, user-driven failures (e.g., explicit malicious instruction) and the lowest for tool-centric and context-delayed vulnerabilities.
Practical and Theoretical Implications
ATBench's comprehensive taxonomy, long-context scenarios, and transparent diagnosis structure enable several crucial lines of research:
- Robustness evaluation: By exposing diverse and obscured safety risks under realistic settings, the benchmark supports robust measurement of agent generalization and error detection capabilities.
- Guardrail development: Detailed diagnostic feedback informs the next generation of guard and intervention models, especially for subtle and temporally separated risk emergence.
- Root cause analysis: Taxonomy-aligned labels and difficult instances allow investigation into the provenance and propagation of agent failures, guiding architectural and training interventions towards better causal understanding.
- Benchmarking advances: The pronounced drop in performance relative to previous datasets provides a tangible upper bound for the current state-of-the-art, exposing clear unsolved challenges in the discipline.
Future Directions
The framework outlined by ATBench creates a foundation for meaningful extensions:
- Multi-label and multi-causal cases: Moving beyond single-label annotation to model compounded or interacting failures.
- Multilingual and multimodal settings: Expanding scenario diversity to cross-linguistic and visual/task-embodied contexts.
- Active Benchmark Evolution: Incorporating adversarial and dynamic test-time generation to further probe and expand the failure surface.
Conclusion
ATBench provides a rigorously designed, richly annotated benchmark for trajectory-level safety evaluation and diagnosis in LLM-based agents. By integrating a diverse toolset, a principled taxonomy for risk/failure/harm, and a combination of synthetic and human-in-the-loop validation, the benchmark reveals substantial deficits in current models—especially for attribution and real-world harm detection. Its taxonomy-structured dataset enables orthogonal and cross-slice failure analysis, setting a new standard for evaluating agentic safety in realistic, long-horizon deployments. ATBench thus constitutes both a practical suite for benchmarking guard systems and a research testbed for the study of agentic risk understanding and mitigation.
Figure 5: ATBench overview: taxonomy, trajectory-level safety judgment, and a prompt-injection failure case.