TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

Published 8 Apr 2026 in cs.CR, cs.AI, cs.CL, cs.LG, and cs.SE | (2604.07223v1)

Abstract: As LLMs evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces. While safety guardrails are well-benchmarked for natural language responses, their efficacy remains largely unexplored within multi-step tool-use trajectories. To address this gap, we introduce TraceSafe-Bench, the first comprehensive benchmark specifically designed to assess mid-trajectory safety. It encompasses 12 risk categories, ranging from security threats (e.g., prompt injection, privacy leaks) to operational failures (e.g., hallucinations, interface inconsistencies), featuring over 1,000 unique execution instances. Our evaluation of 13 LLM-as-a-guard models and 7 specialized guardrails yields three critical findings: 1) Structural Bottleneck: Guardrail efficacy is driven more by structural data competence (e.g., JSON parsing) than semantic safety alignment. Performance correlates strongly with structured-to-text benchmarks ($ρ=0.79$) but shows near-zero correlation with standard jailbreak robustness. 2) Architecture over Scale: Model architecture influences risk detection performance more significantly than model size, with general-purpose LLMs consistently outperforming specialized safety guardrails in trajectory analysis. 3) Temporal Stability: Accuracy remains resilient across extended trajectories. Increased execution steps allow models to pivot from static tool definitions to dynamic execution behaviors, actually improving risk detection performance in later stages. Our findings suggest that securing agentic workflows requires jointly optimizing for structural reasoning and safety alignment to effectively mitigate mid-trajectory risks.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces TraceSafe-Bench, a novel trace-level benchmark to assess LLM guardrails on multi-step tool-calling pipelines.
The methodology leverages deterministic trace mutations and ensemble agent models to simulate twelve distinct risk categories such as prompt injection and privacy leakage.
Results show that structural parsing capability outperforms model scale in detecting safety anomalies, highlighting the need for context-rich guardrails.

TraceSafe: Benchmarking LLM Guardrails on Multi-Step Tool-Calling Trajectories

Motivation and Problem Definition

Autonomous agentic LLMs increasingly operate via complex, multi-step tool-calling pipelines, generating intermediate structured traces that expand the attack surface beyond simple prompt-response interactions. Prior safety benchmarks focus on end-to-end agent robustness or moderation of final outputs, leaving the efficacy of guardrails in controlling mid-trajectory risks largely uncharacterized. The paper introduces TraceSafe-Bench, the first static, trace-level benchmark designed to systematically assess guard models applied in agentic workflows. It targets 12 distinct risk categories, spanning prompt injection, privacy leakage, hallucination (schema/argument errors), and interface inconsistencies. The construction pipeline leverages a Benign-to-Harmful Editing methodology to produce over a thousand precisely annotated execution traces.

Figure 1: The threat landscape in tool-calling pipelines and the TraceSafe-Bench construction workflow, illustrating the structural annotation and mutation of benign traces.

Benchmark Construction Methodology

TraceSafe-Bench starts from rigorously curated benign seeds sourced from the multi-step split of the Berkeley Function Calling Leaderboard (BFCL), ensuring ecological validity and verifiable execution. Diverse traces are generated with ensemble agentic models (Gemini, Qwen, ToolACE, Ministral, GPT-5-mini), filtered for perfect execution before mutation. The Benign-to-Harmful Editing approach applies deterministic, code-driven mutations across user queries, tool definitions, and intermediate traces for each risk category. Critical constraints (e.g., mutability and consistency) are enforced via Check-and-Mutate logic, providing localized ground truth and avoiding free-form LLM rewriting artifacts.

Figure 2: Examples illustrating mutations for each category, including prompt injection, privacy leaks, hallucinated arguments, and interface inconsistencies.

The taxonomy encompasses twelve risk modes across four domains:

Prompt Injection: Attacks via tool descriptions or returned outputs containing adversarial instructions.
Privacy Leakage: Unauthorized propagation of PII, API keys, or internal data to external endpoints.
Hallucination: Erroneous argument invention, schema deviation, and missing type hints leading to ungrounded actions.
Interface Inconsistencies: Version conflicts and semantically contradicting tool descriptions.

Experimental Evaluation Design

The evaluation settings encompass binary (safe/unsafe) and multi-class classification (coarse/fine-grained) across 13 general purpose LLMs and 7 specialized guardrails, including open-weight and proprietary models. Four classification modes assess both intrinsic safety alignment and taxonomy-guided risk detection: binary (with/without schema), 5-class coarse-grained, and 13-class fine-grained categorization. Metrics focus on rejection rates and balanced accuracy between unsafe and benign classes.

Core Results and Analysis

Guardrail efficacy displays strong, architecture-driven divergence:

Structural Bottleneck: Fine-grained trace safety correlates with structural parsing capability ( $\rho=0.79$ against structured hallucination benchmarks), not semantic alignment or jailbreak robustness.
Architecture over Scale: Code-heavy and structured pre-training drive detection; model size alone is not predictive.
Temporal Stability: Accuracy remains robust or improves across long trajectories, with dynamic execution context making behavioral anomalies more salient.

Binary classification is subject to instructionally-induced bias (over-refusal in general LLMs, over-acceptance in specialized models), while fine-grained taxonomy enables systematic anomaly pinpointing and improved calibration. Explicit vulnerabilities (especially prompt injection or privacy leakage in output-adjacent steps) are detected with high accuracy by top models, while subtle interface inconsistencies remain inadequately guarded.

Figure 3: Scatter plots of Pearson correlation between TraceSafe-Bench performance and various benchmarks; strongest correlation is observed with structured data parsing tasks.

Figure 4: Performance trends as a function of trace length and tool-calling steps, showing stable or improved accuracy on longer trajectories.

Figure 5: Confusion heatmap from fine-grained multi-class evaluation. Detection failures overwhelmingly default to the benign class, indicating a calibration challenge for operational anomalies.

Practical and Theoretical Implications

TraceSafe-Bench catalogues agentic failures that evade classical alignment and post-hoc moderation, underscoring the necessity of joint optimization for structural reasoning and risk detection in deployed safeguard models. Structural competence supersedes semantic safety; robust guardrails must be explicitly trained or adapted to parse and interpret dense, nested execution traces, beyond pattern-matching or moral alignment.

Model architecture and input modality now dominate safety performance. The implications of these results extend toward practical real-world agent deployment, especially in sensitive domains (finance, healthcare), requiring structure-aware, context-rich safety systems. The calibration issues detected signal the need to refine both negative and positive classes and improve operational anomaly detection.

Future Directions

Research should focus on dynamic, co-evolutionary guardrails capable of monitoring and intervening in ongoing agentic workflows, incorporating real-time feedback and trace-level auditing. Continuous expansion of risk taxonomies and integration with protocol-specific benchmarks will ensure coverage against emerging attack vectors. There is an evident need for structure-aligned guardrail pre-training and evaluation on authentic tool-use pipelines, leveraging agentic traces as first-class citizens in safety research.

Conclusion

TraceSafe-Bench delivers the first rigorously annotated, trace-level benchmark for evaluating agentic LLM guardrails. The study demonstrates that structural data competence, not just semantic alignment, is the primary bottleneck in intercepting malicious tool-calling actions. Explicit taxonomy, architectural bias, and context dynamics guide the design of future guardrails tasked with securing autonomous LLM workflows. The systematic insights provided motivate a shift toward proactive, structure-aware safety mechanisms for next-generation agentic AI systems.

(2604.07223)

Markdown Report Issue