Autonomous Tool Creation in AI

Updated 5 August 2025

Autonomous tool creation is a paradigm in AI where agents autonomously generate and refine tools, integrating code synthesis and physical design to address novel tasks.
Techniques include closed-loop LLM frameworks, modular hierarchical pipelines, and token-based integration to ensure efficient and reusable tool generation.
This approach transforms computational research and robotics by reducing engineering overhead and enabling self-improving autonomous systems.

Autonomous tool creation is a paradigm in artificial intelligence where agents—primarily LLMs and vision-LLMs (VLMs)—autonomously generate, refine, and deploy reusable tools to solve complex and novel tasks. This capability extends beyond traditional tool use, which confines AI systems to human-built, static toolsets, enabling them to synthesize code, action plans, or even physical objects “on demand.” Modern frameworks implement autonomous tool creation at multiple system levels: in software agents invoking domain-specific functions, as robotic systems inventing tangible implements, or via reference-guided code synthesis for new scientific domains. This article surveys the principled methodologies, architectures, evaluation results, and implications of contemporary approaches to autonomous tool creation.

1. Conceptual Foundations

The notion of autonomous tool creation marks a shift from mere tool usage and discovery to innovation—agents are not only capable of selecting from existing tools but can also invent novel tools by synthesizing or composing new functionalities.

Active inference models, as in "Understanding Tool Discovery and Tool Innovation Using Active Inference" (Collis et al., 2023), formalize this progression. Here, an agent equipped with a probabilistic generative model factorizes hidden states into tool affordances (such as horizontal or vertical reach), enabling offline induction of composite tools from primitive components. The expected free energy formula

$\mathcal{G}_\tau(\pi) \leq - \mathbb{E}\left[ D_{\text{KL}}\left[ Q(s_\tau|o_\tau,\pi) \| Q(s_\tau|\pi) \right] \right] - \mathbb{E}\left[\ln \tilde{\mathcal{P}}(o_\tau)\right]$

illustrates how agents minimize uncertainty and optimize utility, facilitating tool innovation through adaptive model expansion. This conceptual distinction is foundational: tool use, discovery, and innovation are mapped onto different factorizations and policy optimizations within the agent’s internal generative models.

2. Software-Based Tool Creation Frameworks

A diverse set of frameworks address autonomous tool creation in the context of computational agents.

Closed-Loop LLM Frameworks

The LATM (LLMs As Tool Makers) architecture (Cai et al., 2023) employs a closed-loop pipeline where a “tool maker” LLM synthesizes reusable Python tools by “programming-by-example” and a “tool user” LLM subsequently adapts natural language queries into tool invocations. The workflow is:

Tool generation (via few-shot prompts, error correction, and validation with in-context unit tests).
Wrapping (curating function code and exemplar mappings into call-ready APIs).
Decoupled usage, where lightweight LLMs apply cached tools, reducing inference cost and maintaining performance parity with more powerful models.

A functional cache stores the semantics of tool APIs rather than LLM-generated natural language outputs, enabling broad amortization of tool creation cost: computationally intensive tool generation is a one-time cost per class of request, while usage is reduced to lightweight LLM inference.

Modular Hierarchical Frameworks

ATLASS (Haque et al., 13 Mar 2025) and OpenAgent (Lyu et al., 2023) employ multi-phase or hierarchical organizational structures for open-domain, on-demand tool creation. ATLASS features a three-phase pipeline—tool requirement analysis, retrieval/generation (including environment setup and API documentation acquisition), and problem solving—coupled to a persistent tool dataset for efficient reuse and minimization of redundant computation. Tool creation addresses dynamically fetched, poorly documented APIs by integrating web scraping and closed-loop code validation.

OpenAgent (GitAgent) (Lyu et al., 2023) introduces a bi-level experience learning scheme: the agent corrects tool flaws both through its own execution experience and by leveraging collective human experience from GitHub Issues/Pull Requests. Task decomposition and sub-agent delegation further support orchestration of complex tool creation, integration, and maintenance in evolving domains.

3. Representation, Retrieval, and Adaptive Invocation

Token-Based Integration and Generative Selection

ToolGen (Wang et al., 4 Oct 2024) advances end-to-end tool learning by embedding each tool as a unique virtual token in the LLM vocabulary. The approach:

Avoids external retrieval constraints by integrating tool semantics within the model’s parameters (via dedicated loss functions during “tool memorization” and retrieval training).
Enables the LLM to generate tool calls and arguments as a natural extension of next-token prediction, blending tool selection and invocation within a unified generative process.
Demonstrates scalability to >47,000 tools and improved pass rates and win rates compared to systems requiring external retrieval and complex prompt engineering.

Large Tool Libraries and Recursive Search

Tulip Agent (Ocker et al., 31 Jul 2024) introduces modular agent architectures with CRUD access to vector-store-backed tool libraries, enabling:

Task decomposition and recursive semantic search for relevant tools without inflating the prompt context.
Dynamic creation/adaptation of tools, recursive prompt decomposition, and cost reduction—critical for scaling to hundreds of tools.
Empirical evidence from ablations showing notable improvements in accuracy and selection precision when combining decomposition and recursive retrieval.

ToolACE-R (Zeng et al., 2 Apr 2025) equips LLM agents with an adaptive self-refinement loop, allowing iterative improvement of tool calls in the absence of human feedback. The process:

Generates an initial tool invocation and then iteratively refines it:

$A_i = f_\theta(\langle q, T, A_{i-1}, r \rangle)$

until the output stabilizes. Training is model-aware and iterative, aligning the learning distribution with model capabilities via the pass@k criterion. This reduces compute on trivial queries and focuses refinement on more difficult cases.

AutoTIR (Wei et al., 29 Jul 2025) applies reinforcement learning to let the agent dynamically decide if and which tools to call. The reasoning process is formalized as a chain of reasoning actions $\mathcal{A}_k = \langle s_k, t_k, o_k \rangle$ , with hybrid rewards optimizing both tool selection and output accuracy. This enables generalizable, non-static integration of external tools.

4. Reference-Guided and Automated Code Extraction Mechanisms

Reference-Based Synthesis

RefTool (Liu et al., 27 May 2025) operationalizes reference-guided tool creation. LLMs extract textbook structure, synthesize code with illustrative demonstrations, and validate outputs. Tools are organized in a hierarchical toolbox mirroring the reference structure, with two-step retrieval (chapter, then tool) for contextual relevance. Experimental findings show an average absolute accuracy boost of 11.3% across causality, physics, and chemistry benchmarks.

Automatic API Understanding

ToolFactory (Ni et al., 28 Jan 2025) presents a fully automated pipeline for transforming unstructured REST API documentation into AI-compatible tools. Core components:

An information extraction LLM (APILLAMA, optimized via soft prompt) outputs standardized JSON schemas from arbitrary documentation.
A code generator translates this into executable Python tools or OpenAPI YAML descriptions.
Out-of-distribution errors are diagnosed by validation metrics (structural, semantic, and actual API call correctness), while a knowledge base of verified tools supports value inference for incomplete specs.

Performance metrics indicate a 97% valid ratio for JSON extraction and high semantic similarity scores; domain-specific deployments, such as a glycomaterial agent, exhibit scalable integration of heterogeneous APIs.

5. Embodied and Physical Tool Creation in Robotics

Model-Based Trajectory and Geometry Optimization

Learning generalizable tool-use skills (Qi et al., 2023) and robotic tool design frameworks such as RobotSmith (Lin et al., 17 Jun 2025) and VLMgineer (Gao et al., 16 Jul 2025) extend autonomous creation to the physical domain. Salient points:

ToolGen (Qi et al., 2023) generates task trajectories in point cloud space, supporting generalization to unseen tools via a two-part generative model ( $G_{\text{traj}} = (G_{\text{reset}}, G_{\text{path}})$ ), sequential pose optimization, and Chamfer distance–based loss minimization.
RobotSmith (Lin et al., 17 Jun 2025) combines VLM-driven proposal and criticism loops for modular, parameterized tool designs (CSG-inspired composition), optimizing both tool geometry $s \in \mathbb{R}^{d_s}$ and trajectory parameters $q \in \mathbb{R}^{d_q}$ via simulation-grounded joint objective. Achieves a 50% average task success rate, outperforming Meshy (21.4%) and retrieval baselines (11.1%), with physically realized deployments validating transfer to real robots.
VLMgineer (Gao et al., 16 Jul 2025) demonstrates co-design of tools and trajectories via joint VLM code+action generation and evolutionary search (fitness-based selection via custom objectives). Tools are output as URDF assets, with manipulation policies as parametrized motion plans. RoboToolBench tasks show outperformance compared to human-engineered and expert-prompted baselines.

6. Advanced Agentic and Closed-Loop Architectures

Autonomous Deep Agent (Yu et al., 10 Feb 2025) illustrates continuous improvement through recursive “planner–executor” cycles within hierarchical task DAGs (HTDAGs), streamlining reusable API/tool creation from UI interaction traces. Components such as the Autonomous API and Tool Creation system and Prompt Tweaking Engine contribute to ongoing self-optimization—prompt refinement and node-level re-planning allow adaptation as tasks, UIs, or user preferences shift. The service infrastructure provides a backbone for context and dependency management in complex, multi-phase workflows.

ToolMaker (“LLM Agents Making Agent Tools” (Wölflein et al., 17 Feb 2025)) demonstrates an autonomous pipeline for converting code repositories (typically from scientific papers) into LLM-compatible tools, using a Docker-based environment setup and a closed-loop self-correction mechanism driven by comprehensive unit test suites. ToolMaker achieves 80% correctness across challenging computational tasks, far surpassing engineering agent baselines.

7. Implications, Challenges, and Future Directions

Contemporary approaches demonstrate consistent gains over static or hand-crafted tool strategies in both software and robotics domains, supported by rigorous evaluation on standardized and domain-specific benchmarks. Key implications include:

Reduction of engineering overhead and operational cost through persistent tool caches, automated environment orchestration, and iterative refinement.
Enhanced generalization: frameworks like ToolGen and ATLASS demonstrate robust tool creation in the face of limited data, under-specified documentation, or novel task demands.
Future research challenges are highlighted: autonomous structure learning of affordances (Collis et al., 2023), scaling to multi-modal and hierarchical tools, improving safety and security in autonomous code execution (Haque et al., 13 Mar 2025), and bridging from simulation to reliable physical-world deployment (Lin et al., 17 Jun 2025).

The field is converging on unified frameworks where agents dynamically create, adapt, and deploy both software and hardware tools, leveraging foundational models, structured external references, reinforcement learning, and iterative self-improvement. Autonomous tool creation thus stands as one of the core components for next-generation autonomous agents and embodied AI systems.