- The paper introduces TraceSafe-Bench, a novel trace-level benchmark to assess LLM guardrails on multi-step tool-calling pipelines.
- The methodology leverages deterministic trace mutations and ensemble agent models to simulate twelve distinct risk categories such as prompt injection and privacy leakage.
- Results show that structural parsing capability outperforms model scale in detecting safety anomalies, highlighting the need for context-rich guardrails.
Motivation and Problem Definition
Autonomous agentic LLMs increasingly operate via complex, multi-step tool-calling pipelines, generating intermediate structured traces that expand the attack surface beyond simple prompt-response interactions. Prior safety benchmarks focus on end-to-end agent robustness or moderation of final outputs, leaving the efficacy of guardrails in controlling mid-trajectory risks largely uncharacterized. The paper introduces TraceSafe-Bench, the first static, trace-level benchmark designed to systematically assess guard models applied in agentic workflows. It targets 12 distinct risk categories, spanning prompt injection, privacy leakage, hallucination (schema/argument errors), and interface inconsistencies. The construction pipeline leverages a Benign-to-Harmful Editing methodology to produce over a thousand precisely annotated execution traces.
Figure 1: The threat landscape in tool-calling pipelines and the TraceSafe-Bench construction workflow, illustrating the structural annotation and mutation of benign traces.
Benchmark Construction Methodology
TraceSafe-Bench starts from rigorously curated benign seeds sourced from the multi-step split of the Berkeley Function Calling Leaderboard (BFCL), ensuring ecological validity and verifiable execution. Diverse traces are generated with ensemble agentic models (Gemini, Qwen, ToolACE, Ministral, GPT-5-mini), filtered for perfect execution before mutation. The Benign-to-Harmful Editing approach applies deterministic, code-driven mutations across user queries, tool definitions, and intermediate traces for each risk category. Critical constraints (e.g., mutability and consistency) are enforced via Check-and-Mutate logic, providing localized ground truth and avoiding free-form LLM rewriting artifacts.
Figure 2: Examples illustrating mutations for each category, including prompt injection, privacy leaks, hallucinated arguments, and interface inconsistencies.
The taxonomy encompasses twelve risk modes across four domains:
- Prompt Injection: Attacks via tool descriptions or returned outputs containing adversarial instructions.
- Privacy Leakage: Unauthorized propagation of PII, API keys, or internal data to external endpoints.
- Hallucination: Erroneous argument invention, schema deviation, and missing type hints leading to ungrounded actions.
- Interface Inconsistencies: Version conflicts and semantically contradicting tool descriptions.
Experimental Evaluation Design
The evaluation settings encompass binary (safe/unsafe) and multi-class classification (coarse/fine-grained) across 13 general purpose LLMs and 7 specialized guardrails, including open-weight and proprietary models. Four classification modes assess both intrinsic safety alignment and taxonomy-guided risk detection: binary (with/without schema), 5-class coarse-grained, and 13-class fine-grained categorization. Metrics focus on rejection rates and balanced accuracy between unsafe and benign classes.
Core Results and Analysis
Guardrail efficacy displays strong, architecture-driven divergence:
- Structural Bottleneck: Fine-grained trace safety correlates with structural parsing capability (ρ=0.79 against structured hallucination benchmarks), not semantic alignment or jailbreak robustness.
- Architecture over Scale: Code-heavy and structured pre-training drive detection; model size alone is not predictive.
- Temporal Stability: Accuracy remains robust or improves across long trajectories, with dynamic execution context making behavioral anomalies more salient.
Binary classification is subject to instructionally-induced bias (over-refusal in general LLMs, over-acceptance in specialized models), while fine-grained taxonomy enables systematic anomaly pinpointing and improved calibration. Explicit vulnerabilities (especially prompt injection or privacy leakage in output-adjacent steps) are detected with high accuracy by top models, while subtle interface inconsistencies remain inadequately guarded.
Figure 3: Scatter plots of Pearson correlation between TraceSafe-Bench performance and various benchmarks; strongest correlation is observed with structured data parsing tasks.
Figure 4: Performance trends as a function of trace length and tool-calling steps, showing stable or improved accuracy on longer trajectories.
Figure 5: Confusion heatmap from fine-grained multi-class evaluation. Detection failures overwhelmingly default to the benign class, indicating a calibration challenge for operational anomalies.
Practical and Theoretical Implications
TraceSafe-Bench catalogues agentic failures that evade classical alignment and post-hoc moderation, underscoring the necessity of joint optimization for structural reasoning and risk detection in deployed safeguard models. Structural competence supersedes semantic safety; robust guardrails must be explicitly trained or adapted to parse and interpret dense, nested execution traces, beyond pattern-matching or moral alignment.
Model architecture and input modality now dominate safety performance. The implications of these results extend toward practical real-world agent deployment, especially in sensitive domains (finance, healthcare), requiring structure-aware, context-rich safety systems. The calibration issues detected signal the need to refine both negative and positive classes and improve operational anomaly detection.
Future Directions
Research should focus on dynamic, co-evolutionary guardrails capable of monitoring and intervening in ongoing agentic workflows, incorporating real-time feedback and trace-level auditing. Continuous expansion of risk taxonomies and integration with protocol-specific benchmarks will ensure coverage against emerging attack vectors. There is an evident need for structure-aligned guardrail pre-training and evaluation on authentic tool-use pipelines, leveraging agentic traces as first-class citizens in safety research.
Conclusion
TraceSafe-Bench delivers the first rigorously annotated, trace-level benchmark for evaluating agentic LLM guardrails. The study demonstrates that structural data competence, not just semantic alignment, is the primary bottleneck in intercepting malicious tool-calling actions. Explicit taxonomy, architectural bias, and context dynamics guide the design of future guardrails tasked with securing autonomous LLM workflows. The systematic insights provided motivate a shift toward proactive, structure-aware safety mechanisms for next-generation agentic AI systems.
(2604.07223)