Tool Creation Agent

Updated 5 July 2025

Tool creation agents are modular, LLM-based systems that autonomously generate, integrate, and refine tools to address evolving task requirements.
They employ dynamic pipelines with code synthesis and iterative self-improvement to convert raw API documentation into executable, error-checked code.
These systems leverage multi-agent coordination and hierarchical design to efficiently adapt to complex challenges across diverse computational domains.

A tool creation agent is an autonomous or semi-autonomous software system—often LLM-based—that is capable of generating, extending, integrating, or refining external tools or environments to support complex adaptive workflows. Unlike traditional agents that use only a predefined set of tools, tool creation agents actively construct new tools or dynamically extend their capabilities in response to task requirements, environmental feedback, or explicit capability gaps. Advanced instances feature modular design, code synthesis, iterative self-improvement, and integration of heterogeneous computational resources or external repositories.

1. Architectural Principles and Modular Design

Modern tool creation agents exhibit a shift from monolithic, inflexible architectures to modular, composable, and extensible systems. The Core Reinforcement Learning library (CoRL) exemplifies this through an architecture where the core environment class is decomposed into interchangeable Agents, Platforms, Simulators, and auxiliary modules like Glues, Rewards, and Dones, all composed via a directed acyclic graph. Each module, or "functor," encapsulates a specific function such as observation transformation or reward calculation. This modularity facilitates fine-grained control, as illustrated by specifying dynamics for the 1D Docking task with linear ODEs:

$\dot{\mathbf{x}} = A\mathbf{x} + B\mathbf{u} \quad \text{where} \quad A = \begin{bmatrix} 0 & 1 \ 0 & 0 \end{bmatrix}, \quad B = \begin{bmatrix} 0 \ \frac{1}{m} \end{bmatrix}$

In agent systems like GATE, this principle extends to the dynamic management of a hierarchical graph $\mathcal{G}$ of tools, where nodes represent basic or composite tools and edges encode invocation dependencies. Layering and compositionality are used to evolve the functionality contained in the agent's repertoire over time (2502.14848).

2. Tool Generation and Integration Pipelines

Recent frameworks move beyond tool selection to perform autonomous tool generation, often using LLMs to synthesize code from natural language specifications or raw documentation. Systems such as ToolFactory (2501.16945) and Doc2Agent (2506.19998) process REST API documentation—often unstructured and inconsistent—to extract endpoints, parameter schemas, and usage examples into a normalized JSON format. These intermediate representations are then used to generate Python functions or OpenAPI files compatible with agent frameworks, with automatic type inference and error checking.

A representative formula for the LLM extraction process (in ToolFactory) is:

$P(Y|X; S; f_\theta) = \prod_{i=1}^{n_Y} P(y_i \mid y_1, ..., y_{i-1}, X, S, f_\theta)$

where $X$ encodes the API doc, $S$ is the schema, $\phi$ a trainable instruction prompt, and $f_\theta$ denotes model parameters.

Iterative refinement plays a central role. Agents like Doc2Agent or ToolMaker (2502.11705) validate initial tool generation by executing API calls, inspecting errors, and applying automated code corrections or parameter inference—looping until tools pass prescribed test cases or human-in-the-loop evaluation. This process is formalized as a state transition system: $s_{t+1} = (\text{LLMflow}_t \oplus \sigma \oplus \text{LLMflow}_\text{code}, \text{restore}(\bar{\text{environmentflow}}))$ The loop continues until all unit tests pass.

3. Dynamic Tool Selection and Autonomy

With rapidly expanding tool ecosystems, efficiency and adaptability in tool discovery and invocation become central. The Tulip Agent (2407.21778) demonstrates semantic and recursive search within a vector store of tool embeddings, supporting CRUD operations (Create, Read, Update, Delete) and leveraging LLM-generated code to extend the tool library as needed.

This approach is contrasted with traditional methods, which either inject all tool descriptions into the LLM context (incurring prohibitive token costs) or restrict agents to a small, static toolkit. Instead, systems like MCP-Zero (2506.01056) implement "active tool request," enabling agents to proactively output structured requests in the form

<tool_assistant>
  server: [domain]
  tool: [operation]
</tool_assistant>

followed by hierarchical semantic routing—first matching by server, then by tool—using cosine similarity on pre-computed embeddings. The final selection score is often computed as:

$\text{score} = (s_\text{server} \times s_\text{tool}) \times \max(s_\text{server}, s_\text{tool})$

This actively narrows the search space and reduces prompt size, yielding substantial gains in both scalability (e.g., handling 3,000+ tools) and resource efficiency (up to 98% token reduction).

4. Iterative Self-Improvement and Closed-Loop Control

Tool creation agents frequently integrate closed-loop feedback and self-correction. For example, ATLASS (2503.10071) structures the process into (i) requirement understanding, (ii) tool retrieval or dynamic generation (including environment setup and API doc retrieval), and (iii) task-solving by compositional execution. Each generated tool is stored in a Tool Dataset for future reuse, further minimizing redundancy.

In frameworks such as Deep Agent (2502.07056), autonomous API & Tool Creation (AATC) modules extract patterns from UI interactions or simulation, generate atomic or composite APIs, and optimize them through iterative execution and correction, thereby compressing future inference cost for similar tasks.

Learning from experience is integrated at multiple layers: agents like GitAgent (2312.17294) abstract and generalize from human fixes in GitHub Issues or Pull Requests to resolve ambiguous or incomplete documentation. The process is formalized via composite trajectories with action–observation pairs and a learning mechanism:

$A_\text{search}(Q_S) = \mathcal{M}_p^\text{abs}(Q, H)$

with successive steps for candidate evaluation and experience summarization.

5. Multi-Agent and Hierarchical Coordination

Modern systems increasingly employ multi-agent and hierarchical architectures for tool creation and orchestration. ConAgents (2403.03031) divides responsibilities among specialized agents—grounding, execution, and observing—communicating via calibration protocols. In hierarchical MAS settings like HASHIRU (2506.04255), a CEO agent coordinates the instantiation of specialist employees, each using or extending the tool environment. Economic models balance computational and monetary costs of tool creation and invocation:

$C_\text{total} = \sum_i C_\text{hiring,\,i} + \sum_i C_\text{invocation,\,i} + \sum_j C_\text{expense,\,j}$

This enables dynamic self-extension and robust adaptation to evolving resource constraints.

6. Benchmarking, Evaluation, and Practical Applications

Evaluation frameworks assess tool creation agents on metrics such as correctness (unit test and benchmark pass rates), efficiency (cost reduction, resource footprints), scalability (number of tools handled), and adaptability (domain transfer, composite toolchains). On established benchmarks, recent agents demonstrate substantial improvements—e.g., ToolMaker achieves 80% task implementation correctness (vs. 20% for a strong baseline) (2502.11705); Doc2Agent delivers a 55% performance increase and a 90% cost reduction on WebArena (2506.19998).

Applications span virtual agent environments (AgentStudio (2403.17918)), scientific and domain-specific integration (glycomaterials research (2501.16945, 2506.19998)), autonomous robotics, and interactive design generation (MAxPrototyper (2405.07131)). Instances exist where tool creation agents dynamically generate, validate, and deploy scripts for new computational physics tasks, integrate research code as LLM-invocable components, or rapidly compose new UI workflows from user specifications.

7. Open Challenges and Future Research Directions

Despite progress, tool creation agents face ongoing challenges, including:

Robustness to incomplete or inconsistent documentation; reliance on error-prone parameter inference.
Generalization of compositional reasoning over large tool graphs, especially in open-ended or multimodal environments.
Safe execution and ethical filtering when synthesizing and running user- or model-generated code in arbitrary domains.
Efficient benchmarking and evaluation in real-world, complex, and dynamic settings.

Continued work targets tighter feedback integration (e.g., agent debate and reflexion (2503.23781)), active capability acquisition (iterative toolchain construction (2506.01056)), and scalable self-improvement (self-generated Code-as-Task benchmarks (2506.01716), automated verifiable agentic tasks (2506.10055)) as critical levers for the advancement of autonomous tool creation capabilities.

A tool creation agent thus encompasses the integration of modular design, autonomous code and tool synthesis, dynamic environment interaction, and iterative self-improvement—enabling LLM-driven systems to adapt, extend, and refine their toolsets for a continually expanding array of complex tasks across diverse domains.