ToolMisuseBench: An Offline Deterministic Benchmark for Tool Misuse and Recovery in Agentic Systems

Published 2 Apr 2026 in cs.SE and cs.AI | (2604.01508v1)

Abstract: Tool using agents often fail for operational reasons even when language understanding is strong. Common causes include invalid arguments, interface drift, weak recovery, and inefficient retry behavior. We introduce ToolMisuseBench, an offline deterministic benchmark for evaluating tool misuse and recovery under explicit step, call, and retry budgets. The benchmark covers CRUD, retrieval, file, and scheduling environments with replayable fault injection. It reports success, invalid call behavior, policy violations, recovery quality, and budgeted efficiency. We release a public dataset with 6800 tasks and a reproducible evaluation pipeline. Baseline results show fault specific recovery gains for schema aware methods, while overall success remains limited under the released authorization and hard failure settings.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces ToolMisuseBench, a deterministic benchmark isolating five failure modes in agentic systems to assess tool misuse and recovery.
It employs structured, replayable task episodes with controlled fault activations, enabling precise cross-agent comparisons and ablation studies.
Empirical results highlight fault-mode sensitivity and limited recovery performance, underscoring the need for more robust, policy-driven repair strategies.

ToolMisuseBench: Deterministic Benchmarking for Tool Misuse and Recovery in Agentic Systems

Motivation and Benchmark Design

"ToolMisuseBench: An Offline Deterministic Benchmark for Tool Misuse and Recovery in Agentic Systems" (2604.01508) addresses the persistent reliability gap in tool-integrated agentic systems, where execution failures are frequently rooted in malformed tool calls, schema drifts, and brittle recovery behaviors rather than purely in language understanding. The authors introduce ToolMisuseBench, a deterministic, replayable evaluation suite focused on attributing and quantifying operational failures and recovery performance in diverse simulated tool-use environments. Unlike broader capability benchmarks, ToolMisuseBench isolates five explicit failure modes: schema drift, rate limits, timeouts, authorization constraints, and adversarial error rewriting, enabling fine-grained inspection and controlled hardness adjustment across CRUD, retrieval, file, and scheduling scenarios.

The benchmark is formulated as a set of structured tasks, each integrating explicit instruction, tool interface schema, initial state, success criteria, and a fault plan. Deterministic offline simulation ensures task episode replayability for exact cross-agent comparison and longitudinal ablation studies. Evaluations are strictly budgeted by steps, tool calls, and retries, with success adjudicated by state- and transcript-based checks, and every episode’s fault context and outcomes fully auditable.

Implementation and Dataset

The implementation is released as a Python package providing a unified pipeline for data generation, agent evaluation, and artifact creation. The core environments implement deterministic simulators, with a tightly integrated fault engine that supports task-seeded, reproducible fault activations. Error traceability and explicit feedback under adversarial settings are prioritized, facilitating robust post hoc analysis of agent recovery strategies.

The released dataset comprises 6,800 tasks, split across train, development, and public test sets, balanced among the target domains. The dataset design enforces coherence in instruction, initial state, and fault plans, while maximizing syntactic and structural diversity, measured by uniqueness metrics on instruction and state. Versioning and split manifesting ensure reproducibility and enable extension with internal splits of consistent semantics. Quality assurance tools and manifesting enable verification and auditing, mitigating the risk of degenerate or duplicated tasks.

Baselines and Experimental Protocol

ToolMisuseBench establishes three baseline agents: a deterministic heuristic, a schema error-repair agent, and a policy-aware agent atop schema repair. All agents are evaluated under uniform, strict budget and visibility constraints, ensuring behavioral comparisons are not confounded by extrinsic factors. The evaluation protocol outputs aggregate and per-fault diagnostic metrics, including success rate, invalid call incidence, policy violations, recovery rate, tool call efficiency, and catastrophic failure propensity. Critically, the setup emphasizes fault-conditioned analysis, preventing the masking of failure mode regressions by aggregate metrics.

Empirical Results

The primary empirical findings are:

Overall task success for all baselines is 0.25 on the public test split, despite differing in tool call counts and recovery logic.
Recovery is fault-mode sensitive: schema-aware and policy-aware agents achieve recovery rates of 0.50 on timeout and schema drift subsets, while all baselines universally fail under rate limit and authorization constraints, with 0.00 success.
The policy-aware agent does not outperform schema repair in the released configuration, indicating narrow policy coverage.
Increased policy or repair sophistication leads to elevated policy violation rates on schema drift tasks, suggesting overcorrection or inadequate context-aware repair strategies.
Tool call efficiency does not correlate with reliability: heuristic agents use fewer calls but do not improve overall success, reinforcing that robustness is a function of recovery quality under hard budget constraints rather than call minimization alone.
Budgeted success curves are universally flat (AUC 0.25); cap scaling does not alleviate the core limitations imposed by current agent design and failure mode coverage.

These results underscore that naive repair and policy augmentation deliver fault-class-specific gains but leave operational reliability severely constrained under real-world-inspired failure mixtures, especially when hard authorization and rate limiting are present.

Limitations and Directions for Future Research

The benchmark’s strengths are its deterministic replayability, transparent episode auditing, and precise failure attributions. However, the synthetic environments abstract away from certain production complexities—specifically, real-world monetary costs, network/latency jitter, and the full semantic richness of deployed tool APIs. Expansion to longer-horizon workflows and integration of external cost models would further increase ecological validity.

Baseline diversity remains a limitation; more sophisticated learned agents and adaptive model-based approaches should be integrated. The policy-aware agent, in particular, is not a comprehensive policy reasoning system but a lightweight safety heuristic. Future work should explore more robust policy-aligned planning, hierarchical recovery strategies, and selective fallback policies that adapt under compound or intractable error states.

New failure modes and schema variants should be prompted by evolving tool ecosystems, and the taxonomy of faults must be continuously updated to capture emerging operational risks. The benchmark is extensible by construction to accommodate such growth.

Practical and Theoretical Implications

ToolMisuseBench advances methodological rigor in empirical agent reliability research by providing deterministic, failure-attributed, and budget-aware benchmarking. Practically, it offers a foundation for iterative development and controlled comparison of agentic repair, retry, and policy enforcement strategies. The findings indicate that achieving robust tool use requires integrated planning frameworks capable of contextually aware recovery, rather than generic retry or schema-correction heuristics. Theoretically, the benchmark supports studies into the joint optimization of correctness, safety, and budget discipline, and provides a framework for investigating how enforcement infrastructure (e.g., guardrails) can mediate trade-offs in tool-orchestrated settings.

The open release of the dataset and pipeline lowers barriers for replication, extension, and fair third-party evaluation, offering the potential to standardize protocol in reliability-centric agent research.

Conclusion

ToolMisuseBench (2604.01508) establishes a reproducible, deterministic benchmark for studying tool misuse and recovery in agentic systems under explicitly controlled budgets and fault plans. The dataset, implementation, and baseline analyses provided deliver a robust platform for future research into operational reliability, policy-enforced robustness, and adaptive recovery mechanisms in tool-integrated language agents. Subsequent developments in agent design should build upon this reproducible infrastructure, focusing on enhanced fault tolerance, adaptive policy integration, and the ability to generalize recovery and planning strategies to broader and more heterogeneous operational contexts.

Markdown Report Issue