UltraTool: LLM Tool Agent Benchmark

Updated 3 February 2026

UltraTool is a benchmark and technical framework that evaluates LLMs’ ability to plan, dynamically create, and invoke tools in real-world, compositional tasks.
It decomposes tasks into planning, tool creation, and tool usage, using JSON-structured outputs to rigorously assess model performance across diverse domains.
Results from 22 domains highlight challenges in multi-tool chaining and JSON compliance, driving research for more robust, autonomous LLM tool agents.

UltraTool is a large-scale benchmark and technical framework for evaluating and advancing LLMs’ (LLMs) capabilities in compositional, multi-step, real-world tool use. Unlike prior benchmarks or tool-use protocols constrained to fixed toolsets or narrowly defined steps, UltraTool targets the full lifecycle of tool utilization—including natural language (NL) planning, dynamic tool creation, and parameterized tool invocation—across complex, expert-derived task scenarios. The associated datasets and metrics drive principled evaluation of LLMs’ proficiency in comprehensive tool-based problem solving, and illuminate the critical limitations and points of failure in current LLM architectures when operating in realistic environments (Huang et al., 2024).

1. Scope and Motivation

UltraTool addresses the gap in existing tool-use evaluation paradigms. Prior work (e.g., ToolAlpaca, APIBench, MetaTool) predominantly employs synthetic queries and a fixed, pre-defined API pool, neglecting challenges posed by real-world ambiguity, multi-tool integration, and on-the-fly tool definition. UltraTool instead:

Embeds tasks from 22 high-variance domains (e.g., Flight, CRM, Finance, Weather, Medical), yielding 5,824 highly compositional queries.
Requires models to infer and execute complete planning (“Planning”), dynamically design missing API interfaces (“Tool Creation”), and generate correct call sequences and argument lists (“Tool Usage”).
Benchmarks true compositional reasoning, not mere single-call tool proficiency.

A core goal is to push LLMs beyond isolated function invocation, enabling evaluation of in-context reasoning, error recovery, and generalized tool synthesis, all essential for deployment as autonomous agents in open-world environments (Huang et al., 2024).

2. Benchmark Structure and Task Decomposition

UltraTool decomposes each scenario into three interdependent computational phases:

Planning:
- Input: user query $Q$
- Output: a hierarchical NL plan %%%%1%%%%, each $s_i$ an actionable sub-task.
- Requirements: (a) full decomposition and coverage; (b) logical, executable plan tree structure; (c) conciseness without underspecification.
- Example: For "Find and book the cheapest flight from Beijing to New York, then pick a hotel and map the subway route," plans might include "find flights," "book flight," "search hotels," "compute subway route."
Tool Creation:
- Awareness: For each plan step $s_i$ , predict $p_i^{\mathrm{TCA}}\in\{0,1\}$ (indicator: is the step covered by existing toolset $\hat T$ ?).
- Creation: For steps with $p_i^{\mathrm{TCA}}=0$ , synthesize a JSON tool skeleton (requiring correct naming, description, parameter and return schema).
Tool Usage:
- Awareness: Per step $s_i$ , predict if the step requires tool usage ( $p_i^{\mathrm{TUA}}$ ).
- Selection: Match sub-task to correct tool in augmented toolset $\bar T$ (includes distractors).
- Usage (Argument Generation): Populate tool arguments in correct JSON structure $p_i^{\mathrm{TU}}$ .

Each sub-task is evaluated individually, as well as in aggregate via multi-step and end-to-end success metrics. This decomposition directly targets the known bottlenecks in LLM multi-tool chaining and independent tool synthesis (Huang et al., 2024).

3. Dataset Characteristics and Real-world Complexity

UltraTool’s dataset is distinguished by coverage, granularity, and open-world alignment:

Statistics:
- 5,824 example queries spanning 22 application domains.
- Average query: 12.3 plan steps, 2.7 tool calls, and 3.05 arguments per call.
- 2,032 distinct tool skeletons; not limited to a fixed set, enabling dynamic tool construction.
- Bilingual: Native Chinese (original), English (via GPT-4 plus human review).
Structure: Tasks encode tree-structured, interleaving NL logic and API interactions, including conditional/nested tool usage and out-of-toolset task decomposition. This captures practical tool agent requirements (entity linking, cross-step argument propagation).
Format: Inputs and outputs are JSON-serializable, enforcing strict structure for automatic evaluation and facilitating model integration (Huang et al., 2024).

4. Evaluation Methodology and Metrics

UltraTool employs a multifaceted evaluation framework:

Multi-Dimensional LLM-as-Judge: A reference LLM (typically GPT-4) rates outputs on axes including Accuracy, Completeness, Executability, and Format Compliance ( $\mathbf{S} \in [1,10]^{m+1}$ ; normalized to percentage).
Key-Value Accuracy: For phases requiring structured prediction, computes per-step correctness as:

$\mathbf{Acc} = \frac{1}{N}\sum_{i=1}^N \mathbf{1}\left[p_k^{(i)}=y_k^{(i)} \wedge p_v^{(i)}=y_v^{(i)}\right].$

Levenshtein Normalized Score (LevNorm): For string-valued arguments,

$\text{LevNorm}(p_v,y_v) = 1 - \frac{d_\mathrm{Lev}(p_v,y_v)}{\max(|p_v|,|y_v|)},$

where $d_\mathrm{Lev}$ is the Levenshtein distance.

Aggregate Tool Utilization Score:

$S = \alpha A_\mathrm{plan} + \beta A_\mathrm{create} + \gamma A_\mathrm{use}$

with $\alpha+\beta+\gamma=1$ , typically reported alongside per-phase results.

Planning Recall ( $R_{\rm plan}$ ): Multi-step planning success is quantified by:

$R_{\mathrm{plan}} = \frac{1}{n} \sum_{i=1}^n \mathbf{1}\left[\hat{s}_i = s_i\right].$

These metrics enable granular analysis of failure points—clarifying, for instance, that JSON format-compliance and accurate argument synthesis are limiting factors for smaller models (Huang et al., 2024).

5. Baseline Performance and Model Insights

UltraTool benchmarks a spectrum of LLMs, including:

Closed-source: GPT-3.5-turbo-1106, GPT-4-1106-preview.
Open-source: Qwen (7B/14B/72B), LLaMA2 (7B/13B/70B), Mistral-7B, Baichuan2 (7B/13B), Vicuna (7B/13B), ChatGLM3-6B.

Quantitative results (average of six dimensions, percent):

Model	Chinese-dataset	English-dataset
GPT-4	76.04	74.58
GPT-3.5	59.68	58.90
Qwen-72B	64.12	62.94
Mistral-7B	55.05	54.76
LLaMA2-70B	49.17	51.90
Baichuan2-13B	46.86	42.08

Key findings:

Large closed-source LLMs (e.g., GPT-4) outperform best open models by ~12 percentage points.
Small open-source models (7B/13B) often succeed in syntactic planning but struggle with tool skeleton creation and strict JSON output, leading to unusable or incomplete results.
JSON-compliance is the critical bottleneck for practical tool agent deployment; successful multi-tool chaining and argument propagation remain ongoing challenges.

6. Key Advances and Future Research Directions

UltraTool's formulation and release introduce several advances to the field:

Integrated, End-to-End Workflow Evaluation: UltraTool is the first public benchmark to jointly assess NL planning, dynamic tool (API) creation, and execution-relevant tool usage in a single, cohesive evaluation pipeline.
Real-World Compositionality: The use of expert-crafted, non-synthetic queries and the allowance for dynamic toolset modification better reflect real agent demands and expose system weaknesses masked by synthetic benchmarks.
Diagnostic, Multi-Faceted Metrics: Multi-dimensional scoring (via LLM-as-Judge, key-value accuracy, LevNorm) enables precise attribution of failure modes, guiding targeted model improvements—e.g., format-aware decoding.
Catalyst for Robust LLM Tool Agents: By furnishing a ready-made testbed for end-to-end tool agent experiments, UltraTool is positioned to drive research in areas including:
- Closed-loop simulation with executable “mock” tools for validating generated arguments.
- Fine-tuning on nested or chained call workflows for improved internal consistency.
- Novel decoding/constraint mechanisms to improve structure adherence in output.

A plausible implication is that future LLM-assisted agents will require dedicated architectural or training innovations beyond general scaling to meet the compositional, format-compliance, and dynamic creation challenges illuminated by UltraTool (Huang et al., 2024).

7. Implementation, Access, and Practical Usage

UltraTool is publicly available at https://github.com/JoeYing1019/UltraTool and supports straightforward experimental integration:

Repository Layout: Includes data (JSON-formatted, per-language), tool skeleton definitions, metrics/evaluation scripts, and few-shot phase-specific prompt examples.
Model Integration Workflow:

Clone repository, install dependencies.
Implement a model wrapper that, for each UltraTool phase (planning, creation, usage, etc.), returns the required output in text or JSON.
Invoke the provided evaluation scripts (e.g., evaluate.py) to score model outputs across dimensions (accuracy, LevNorm, LLM-judge scores).

Modularity: Supports plug-and-play evaluation for arbitrary model architectures and scales to both closed-source API use and fully open-source model pipelines.

UltraTool thus establishes a replicable, fine-grained benchmark for the rigorous study of LLMs as tool agents in real-world contexts, forming the basis for systematic improvements in multi-step autonomous problem-solving (Huang et al., 2024).

Markdown Upgrade to Chat

References (1)

Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to UltraTool.