Creative Tool Use Benchmark for AI
- Creative Tool Use Benchmark is a framework that rigorously measures AI's capacity to plan, select, operate, and invent tools in real-world scenarios.
- It integrates multi-domain datasets and multi-stage task evaluations, employing metrics like execution success rate and tool-F1 to assess performance.
- The benchmark emphasizes process supervision, iterative planning, and creative tool generation, highlighting significant gaps between human and model capabilities.
Creative Tool Use Benchmark
The Creative Tool Use Benchmark comprises a suite of datasets and protocols designed to rigorously assess artificial intelligence systems—especially LLMs and multimodal LLMs (MLLMs)—for their capacity to plan, select, operate, understand, and invent tools in complex, realistic scenarios. Creative tool use is operationally defined as the ability to not only utilize existing tools, but also to devise novel composite or repurposed implements when standard solutions are unavailable, traversing the full cycle from abstract planning to grounded tool creation and use. This benchmarking paradigm encompasses a diverse collection of task formats, tools, evaluation metrics, domains, and supervision structures. It reflects recent progress and persistent challenges in measuring genuine tool-centric reasoning and creativity, drawing from contemporary benchmarks including PhysToolBench (Zhang et al., 10 Oct 2025), m&m's (Ma et al., 2024), GTA (Wang et al., 2024), CANVAS (Jeong et al., 25 Nov 2025), UltraTool (Huang et al., 2024), and ToolComp (Nath et al., 2 Jan 2025).
1. Core Principles and Definitions
Creative tool use benchmarks are motivated by the recognition that human intelligence fundamentally relies on tool exploitation, adaptation, and invention, a faculty currently unquantified in most AI models. Unlike conventional single-step tool invocation or constrained tool selection, creativity benchmarks probe for the capacity to:
- Recognize and match tools to tasks based on functional mapping.
- Reason about physical properties, affordances, tool states, and combination requirements, including malfunction and composition.
- Invent ad hoc implements or new tool specifications, repurposing or synthesizing when standard tools are missing.
- Plan multi-step workflows for solving real-world, often open-ended goals.
Benchmarks such as PhysToolBench and UltraTool formalize “creative tool use” by explicitly partitioning evaluation into three phases: planning (decomposition of complex tasks), creation (design/specification of absent tools), and usage (parameterization and orchestration of tool calls) (Zhang et al., 10 Oct 2025, Huang et al., 2024). GTA, m&m’s, and ToolComp emphasize execution over real tools, process supervision, and multimodal grounding.
2. Dataset Design and Task Taxonomy
Benchmarks are constructed via human-authored, multi-domain queries, real versus synthetic tool libraries, and multi-stage workflows. Representative designs include:
| Benchmark | Task Types | Tool Inventory | Domains |
|---|---|---|---|
| PhysToolBench | Recognition, Understanding, Creation (Easy–Hard) | Physical objects (visual) | Daily life, industrial, outdoor, professional |
| UltraTool | Planning, Creation, Usage | Open-ended skeletons (specification only) | 22 domains (travel, finance, etc.) |
| m&m's | Multi-step multi-modal plans | 33 actual tools (ML models, image/audio APIs) | Image, text, audio |
| GTA | Step-implicit multimodal, creativity | 14 live APIs in 4 categories | Real-world multimodal contexts |
| CANVAS | UI Replication, Modification | 50 Figma plugin tools | Mobile UI design |
| ToolComp | Multi-step reasoning, process supervision | 2–11 API/compute tools | Arithmetic, retrieval, programming, finance |
PhysToolBench features 1,000 VQA-style instances, partitioned into three progressive tiers: Easy (Tool Recognition), Medium (Attribute, Combination, Availability probes), and Hard (Tool Creation) (Zhang et al., 10 Oct 2025). UltraTool’s 5,824 triplets include open-ended multi-level plans, tool specifications, and tool usage messages (Huang et al., 2024). GTA and m&m’s use executable tool chains (with feedback or live APIs) on multi-modal inputs. CANVAS narrows focus onto software design, requiring stepwise tool invocation for complex state edits. ToolComp excels in process supervision: annotators label every intermediate step for correctness.
3. Evaluation Protocols and Metrics
Creative tool use benchmarks employ a variety of quantitative and qualitative metrics tailored to their philosophy and data structure:
- Task/Step Accuracy: Primary metric in PhysToolBench, UltraTool, GTA, CANVAS, ToolComp. Computed as strict label match per instance or step.
- Tool-F1, ArgName-F1 (m&m’s, GTA): F1 scores on tool/argument prediction versus ground-truth chains.
- Execution Success Rate (, ): Fraction of plans completing without error and producing correct outputs (Ma et al., 2024).
- Plan Correctness (): Fraction of model-generated plans exactly matching gold plans at the tool+arg level.
- Chain Completion Rate, Category F1 (GTA): Per-category precision, recall, F1 for tool selection/invocation.
- LLM-as-Judge Point-wise Scoring (UltraTool): GPT-4 rates planning and tool-creation on multi-dimensional axes (accuracy, completeness, executability, format, efficiency, rationality).
- Levenshtein/Normalized Edit Distance: Used in UltraTool’s tool usage phase for argument string fidelity.
- Perceptual/Component Similarity (CANVAS): SSIM, saliency map correlation, BLIP caption/cosine similarity, text F1 between generated and ground-truth UI states (Jeong et al., 25 Nov 2025).
- Process Reasoning Accuracy (ToolComp): Correctness on intermediate steps in tool-use trajectories.
In ToolComp, final-answer accuracy is paired with process-supervision metrics; PRMs (process-supervised reward models) outperform ORMs (outcome-supervised) by 19pp (base) and 11pp (SFT) in rank@1 for trajectory ranking.
4. Experimental Findings and Model Performance
Empirical results across benchmarks consistently highlight significant gaps between human-level creative tool use and contemporary LLM/MLLM performance:
- In PhysToolBench, human accuracy is 87.9–93.2%. Top models (GPT-5, GPT-4o, o3) score only 60–63% overall; open-source MLLMs cluster at ~40–55%, with more specialized embodied/VLA models under 40% and struggling on hard tool creation tasks (<46%) (Zhang et al., 10 Oct 2025).
- UltraTool reports 76.04% for GPT-4, Qwen-72B at 64.12%, and open-source 7B models between 8–31% (Huang et al., 2024). Model scale correlates with performance.
- In m&m’s, GPT-4 with multi-step planning, JSON output, and full feedback (parsing, verification, execution) achieves up to 89% tool-F1 and 98% pass rate. Human execution success () is higher, indicating unsolved challenges in LLM grounding and robustness (Ma et al., 2024).
- GTA’s real-world tasks challenge all systems: GPT-4 solves <50% of the total, most models <25%. F1_C for creativity tool selection with GPT-4 reaches 0.8955, but argument formatting (ArgAcc ≈35–38%) limits full pipeline success (Wang et al., 2024).
- CANVAS demonstrates SSIM, saliency, and BLIP metrics for design replication/modification, with GPT-4.1, Gemini-2.5-Pro, and others leading at 0.76–0.89 SSIM. Error analysis reveals strategic but imperfect tool call domains (Jeong et al., 25 Nov 2025).
Common error modes encompass misidentification of visually similar objects, shallow commonsense reasoning, hallucinated affordances, poor spatial/size reasoning, argument formatting failures, and tool invocation syntax errors.
5. Annotation Strategies, Supervision, and Data Generation
Benchmarks leverage a variety of human and synthetic annotation protocols:
- ToolComp integrates detailed human process supervision at every intermediate step (ReAct format), annotating Thought, Action, Action Input, and Plan correctness (Nath et al., 2 Jan 2025).
- m&m's and UltraTool rely on multi-round human verification; UltraTool merges similar tool skeletons to form a gold toolset.
- Synthetic data pipelines, as in ToolComp, use policy and critic models to simulate trajectories, discarding ambiguous or error-prone runs.
- CANVAS appends conversation histories (Thoughts, Actions, Observations) for multi-turn context maintenance and stopping criteria.
A plausible implication is that process-supervised reward modeling may be critical to efficiently scaling high-fidelity training across complex tool-use domains.
6. Recommendations, Limitations, and Future Directions
Benchmarks converge on several recommendations and open challenges:
- Strict API schemas and JSON formatting (as prompt templates) mitigate argument errors.
- Multi-step, single-shot planning (versus ReAct) enhances execution success, yet modular architectures (separating planning from invocation) may further enhance reliability.
- Feedback mechanisms (parsing, verification, execution) are necessary but insufficient for bridging the gap to human performance.
- Extension to deeper, conditional, or branching tool chains; support for video/3D modalities; and real-world grounding remain open frontiers.
- Process supervision, human-in-the-loop curation, and imitation learning from annotated traces are essential for robust evaluation and training.
- PhysToolBench proposes integration of vision-centric reasoning (object detection and multi-level analysis) to improve model performance on hard tasks.
- UltraTool tool skeletons are currently non-executable; future work is needed on real-world deployment.
Benchmarks such as ToolComp, CANVAS, and UltraTool collectively highlight that while significant progress has been achieved in modeling multi-step, creative tool use, substantial deficiencies persist in physical and semantic reasoning, argument fidelity, and the inventive capacities required for true general-purpose intelligence.
7. Representative Example Tasks
Creative tool use benchmarks provide a spectrum of illustrative scenarios:
- PhysToolBench Hard creation: “Use a coin to turn a flat-head screw when no standard screwdriver is available.”
- GTA creativity chain: “Stylize a photo to cartoon, generate a caption, detect region, overlay title above detected subject.”
- UltraTool planning/creation: “Book travel, reserve hotel, fetch custom weather data—requiring invented WeatherForecast3Day tool.”
- CANVAS modification: “Make a button slightly darker, increase its corner radius—requiring precise attribute update with tool JSON.”
- ToolComp process: “Multi-hop financial calculation using historical stock data via Python interpreter, weather API, and arithmetic reasoning with chain-of-thought and step-level verification.”
These tasks collectively encapsulate the underlying vision of creative tool use: not rote invocation, but adaptive, strategic, and inventive application of a diverse set of tools for open-ended, real-world goals.