TRAJECT-Bench: LLM Tool-Use Benchmark
- The paper introduces trajectory-level metrics that reveal novel failure modes and scaling bottlenecks in LLM agents' multi-step tool use.
- It evaluates 1,228 production APIs across ten domains using JSON-formatted trajectories for both parallel and sequential tool invocations.
- The framework leverages ReAct with dynamic retrieval to benchmark performance on simple and hard queries while ensuring parameter and order fidelity.
TRAJECT-Bench is a large-scale, trajectory-aware benchmarking suite for evaluating the multi-step tool-use competence of LLM agents in production-relevant environments. It addresses the limitations of previous evaluation schemes that focus solely on final-answer correctness by explicitly measuring agentic performance over fine-grained tool usage trajectories: selection, argument/parameterization, and order/dependency satisfaction. TRAJECT-Bench comprises 1,228 executable APIs sourced from RapidAPI, spanning ten practical domains and paired with 5,670 user queries of varying difficulty, each requiring complex parallel or interdependent tool calls. It introduces trajectory-level metrics that reveal novel failure modes and scaling bottlenecks in contemporary LLM agentic tool use.
1. Motivation and Benchmark Architecture
Conventional benchmarks for LLM agent tool use generally emphasize final answers, typically reporting pass rates or win rates. These methodologies neglect the compositional and procedural aspects of agentic tool use—specifically, how agents select, configure, and orchestrate multiple tools to fulfill complex tasks. In many real-world contexts, a trajectory may comprise 3–10+ chained API calls with strict interdependencies, where any local error can be catastrophic for task success.
TRAJECT-Bench fills this methodological gap by providing both a comprehensive tool-suite and a trajectory-centric evaluation harness. At test time, the system presents queries to LLMs along with a specified tool context, and instructs models to construct a stepwise tool-call trajectory in JSON format. Each predicted tool call is executed against the real API, yielding outputs that are compared to gold-standard trajectories at multiple granularities. Optionally, a reference LLM (Claude-4) adjudicates both final-answer and trajectory satisfaction when gold traces are incomplete (He et al., 6 Oct 2025).
2. Task Construction and Tool Suite
TRAJECT-Bench curates 1,228 production-style APIs, eliminating duplicates and low-information tools by fusing documentation and observed I/O. Parameter complexity is systematically controlled—tools often require multiple, structured fields to stress-test agentic reasoning. The APIs are categorized into ten domains: Travel, Mapping, Finance, Weather, E-commerce, News/Media, Gaming, Email, Education, and Music.
Tasks are formulated as natural-language queries mapped to these APIs, demanding multi-tool planning. For example, the "Airbnb listings: Prices and Availability by lat/lng" API requires precise specification of year, latitude, longitude, range, and month fields, with optional refinement for bedrooms and guest capacity. Tasks routinely involve both parallel trajectories (independent calls) and sequential chains (where each tool’s parameters may be dynamically bound to previous outputs).
3. Trajectory Synthesis and Query Design
TRAJECT-Bench operationalizes two trajectory structures:
- Parallel trajectories (): Sets of unordered, independent tool invocations, where each input is fixed: .
- Sequential trajectories ($T^{\sequential}$): Ordered chains , with inputs possibly bound to outputs of .
For synthesis, an LLM is prompted to structure valid plans using either direct templates or manual graph-based chain construction. Chains encode explicit parameter bindings—for example, propagating country_id in finance or travel domains.
Each trajectory is paired with two types of queries: “Simple” (explicit mention of APIs/parameters) and “Hard” (colloquial, intent-obscured paraphrase), to rigorously evaluate both surface-level and latent intent matching. The dataset includes 2,000 simple and 2,000 hard queries for parallel, and 1,870 sequential queries (He et al., 6 Oct 2025).
4. Evaluation Metrics and Diagnostics
The suite introduces granular, trajectory-level metrics for comparing predicted and gold-standard tool-use paths:
- Final Accuracy (Acc): Binary match from LLM judge ( vs. )
- Tool Selection Correctness:
- Exact Match (EM): Does the predicted tool sequence exactly match gold?
- Inclusion: Fraction of gold tools included in prediction
- Argument/Parameter Usage: Proportion of fields in each call correctly matched
- Dependency/Order Satisfaction: Fraction of gold order relations preserved
- LLM-Judge Trajectory Satisfaction: 0–10 scale rating of how well the trajectory solves the query, when explicit gold traces are missing
All scores are averaged over test cases to derive aggregate performance statistics for agentic tool use (He et al., 6 Oct 2025).
5. Experimental Framework
TRAJECT-Bench assesses ten contemporary LLMs—including Claude-3.7/4, Gemini-2.5-flash/pro, DeepSeek-V3.1, Qwen3-235B-A22B, Kimi-k2, o4-mini, gpt-oss-120B, and GPT5-mini—using multiple prompt strategies (direct JSON, chain-of-thought breakdown). Task context is varied between full tool sets (unmanageably large), domain-filtered (~100–250 tools), and retrieval-augmented (top-20 by embedding similarity via all-MiniLM, bge-large, or ToolBench-IR embeddings), with all evaluations conducted zero-shot. ReAct agentic inference is also benchmarked in static and dynamic retrieval modes (He et al., 6 Oct 2025).
6. Results, Scaling Behavior, and Failure Modes
Performance Analysis
The benchmark reveals significant complexity in agentic tool use. On parallel-domain queries, state-of-the-art LLMs achieve moderately high EM (e.g., Claude-4 EM=0.846 for simple, 0.445 for hard), but performance degrades sharply on hard queries and sequential trajectories. Usage and final answer accuracy similarly suffer; EM on sequential tasks is 3–5% lower across models (see Table 1 and Table 2 below).
Selected Results Table (Parallel, Domain Tools)
| Model | EM (Simple) | Usage (Simple) | Acc (Simple) | EM (Hard) | Usage (Hard) | Acc (Hard) |
|---|---|---|---|---|---|---|
| Claude-4 | 0.846 | 0.839 | 0.905 | 0.445 | 0.794 | 0.517 |
| Gemini-2.5-pro | 0.851 | 0.835 | 0.911 | 0.442 | 0.785 | 0.498 |
Key Observations
- Query Difficulty: Hard queries (intent-obscured) induce ~0.40 drop in EM and similar decrements in usage and accuracy.
- Trajectory Length: All models experience the steepest accuracy decline in the transition from 3 to 5 calls; smaller models fail for long chains ().
- Retrieval: Embedding-based IR is marginally effective on simple queries, but severe EM/Acc collapse ( EM) occurs on hard queries due to semantic similarity bottlenecks.
- Agentic Inference: ReAct with dynamic retrieval at each reasoning/action step provides maximal gains, especially on complex tasks (Claude-4 EM from 0.445 to 0.472).
Failure Modes
Manual inspection categorizes pervasive errors:
- Similar-Tool Confusion: Inaccurate substitution among semantically related APIs.
- Parameter-Blind Selection: Incorrect tool use driven by missing or mismatched required parameters.
- Redundant Calls: Over-fetching or irrelevant tool invocation.
- Hard Query Misinterpretation: Failure to map colloquial descriptions to underlying intent (e.g., missing implicit “review score” sort).
7. Insights, Limitations, and Future Directions
TRAJECT-Bench demonstrates that LLM agentic tool use is fundamentally brittle to trajectory length, parameterization complexity, and query ambiguity. The primary bottleneck is mid-length chains; pure semantic retrieval is inadequate for implicit intent; iterative agentic reasoning with dynamic context retrieval is markedly beneficial.
Suggested future avenues include:
- Extension to richer graph trajectories—branching, loops, and DAG-style plans.
- End-to-end trajectory-level supervised fine-tuning for more robust planning and error recovery.
- Development of multi-objective retrieval methods incorporating intent, schema, and semantic signals.
- Interactive, human-in-the-loop validation and correction.
- Expansion of domains, integration of authentic user logs, consideration of dynamic tool availability, and optimization for latency and cost (He et al., 6 Oct 2025).
By exposing trajectory-level signals and compositional diagnostics, TRAJECT-Bench provides rigorous foundations for systematic, scalable evaluation and improvement of agentic tool-use in LLM-driven environments.