CREATOR Algorithm: Dynamic LLM Tool Creation
- CREATOR is a framework for LLMs that dynamically invents bespoke tools by disentangling abstract tool creation from concrete decision execution, enhancing generalization and interpretability.
- The system comprises modular stages—including Documentation Reader, Code Realizer, and Rectifier—that enable automated code synthesis, sandbox execution, and systematic error correction.
- Empirical evaluations on math and tabular benchmarks show CREATOR achieving up to a 13-point accuracy improvement over traditional chain-of-thought and API tool-use baselines.
CREATOR is a framework for LLMs that extends their capabilities beyond statically invoking external APIs, enabling them to invent, document, realize, and debug bespoke tools to robustly solve complex reasoning tasks. The core innovation of CREATOR is the explicit disentanglement of abstract tool creation ("what tool do I need?") from concrete decision execution ("how do I use the tool for this query?"), yielding improved generalization, interpretability, and reliability on math and tabular reasoning benchmarks. The CREATOR system defines a multi-stage architecture featuring document-driven abstraction, automated code synthesis, registry-based tool storage, decision program generation, sandbox execution, and automatic rectification via LLM re-prompting on errors. Empirical evaluation demonstrates consistent and significant gains over chain-of-thought (CoT), program-of-thought (PoT), and vanilla API tool-use baselines, particularly in scenarios lacking pre-existing APIs, as exemplified by the newly introduced Creation Challenge dataset (Qian et al., 2023).
1. Motivation and Conceptual Foundation
Traditional LLM tool-use pipelines are bounded by three primary constraints: limited scope due to a static small API set, brittle reasoning when planning and execution are entangled in a single inference chain, and fragile execution due to inadequate error handling. While LLMs enhanced with tool-calling interfaces (e.g., calculator, web search) can address well-defined problems, they are fundamentally restricted when novel or domain-specific utility abstractions are required, or when multi-stage reasoning with intermediate state is crucial.
CREATOR addresses these problems by empowering the LLM to author domain-appropriate tools on demand. This paradigm shift expands the scope of solvable problems, separates high-level strategy (tool creation) from low-level tactics (decision execution), and integrates error diagnosis and correction. An explicit four-stage pipeline—Creation, Decision, Execution, and Rectification—decomposes the end-to-end process, with the first two stages corresponding to reasoning of concept and action, and the final stages implemented via code interpretation and rectification.
2. Formalization: Disentangling Abstract Creation and Concrete Execution
CREATOR's methodology is formally structured around two LLM-driven stages:
- Abstract Tool Creation: Given a natural language query and a set of few-shot demonstrations in the form , the LLM produces a ToolSpec structure (signature, docstring, and pseudocode). This generic abstraction is realized to executable code (ToolCode). The process is formalized as:
Example: For quadratic equation solving, the ToolSpec may define def solve_quadratic(a: float, b: float, c: float) -> List[float]: with a docstring documenting the mathematical formula and expected behavior.
- Concrete Decision Execution: With access to created tool(s) , the LLM generates a decision program that orchestrates parsing, tool invocation, postprocessing, and answer synthesis. Formally:
where denotes the registry of available ToolCodes. The generated is a valid Python script that imports tool implementations, extracts arguments, selects the appropriate tool, passes arguments, and formats the answer.
By partitioning reasoning across these two stages, CREATOR isolates conceptual abstraction from executional specification, thereby enhancing modularity and recovery from error.
3. System Architecture and Data Flow
CREATOR is composed of six core modules, each responsible for an explicit stage in the data and control flow:
| Module | Input(s) | Output(s) |
|---|---|---|
| Documentation Reader | Problem , demonstration specs | ToolSpec (signature + docstring) |
| Code Realizer | ToolSpec | ToolCode (Python function) |
| Tool Registry | ToolCode(s) | (stored tools) |
| Decision Maker | , Tool Registry () | Decision program (Python code) |
| Execution Engine & Error Monitor | ToolCode(s), | Execution result ("Answer" or error traceback) |
| Rectifier | , ToolCodes, , error traceback | Corrected decision program |
Control proceeds from parsing the problem and demonstrations, through abstraction and code realization, to execution in a sandboxed environment. On error, the system re-prompts the LLM with the code and traceback, iterating up to a maximum number of rectification attempts.
4. Algorithmic Procedure
The pseudocode specifying the core CREATOR workflow is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
function CREATOR(Q, demos_creation, demos_decision, max_rectify=3): # Stage 1: Abstract Tool Creation ToolSpecs = LLM_Create_Specs(Q, demos_creation) ToolCodes = [LLM_Realize_Code(spec) for spec in ToolSpecs] register_tools(ToolCodes) # Stage 2: Concrete Decision Execution decision_prog = LLM_Decide(Q, ToolCodes, demos_decision) # Stage 3 & 4: Execution + Rectification for attempt in range(max_rectify + 1): output, error = execute_code(ToolCodes, decision_prog) if error is None: return parse_answer(output) else: # Rectification decision_prog = LLM_Rectify(Q, ToolCodes, decision_prog, error) raise RuntimeError("Failed after rectification") |
Crucially, rectification is built in as a feedback loop involving LLM re-prompting with both the code and its error trace, supporting automatic patching of decision programs.
5. Benchmarks, Metrics, and Empirical Results
Empirical evaluation was conducted using ChatGPT (gpt-3.5-turbo, , 512 max tokens) on the following benchmarks:
- MATH: 7-domain competition-grade mathematics problems.
- TabMWP: Table-based word problems across grade levels.
- Creation Challenge: A newly introduced dataset with 2,000 diverse, tool-creation-required queries.
Baselines included plain LLM response ("Vanilla"), Chain-of-Thought (CoT), Program-of-Thought (PoT), Tool Use (e.g., WolframAlpha API), and CREATOR-Entangled (creation+decision fused). Core metrics were exact-match numerical accuracy and rate of successful code execution.
| Benchmark | Best Baseline (%) | CREATOR (%) | Improvement (abs. pts) |
|---|---|---|---|
| MATH | 46.5 | 59.7 | +13.2 |
| TabMWP | 87.3 | 94.7 | +7.4 |
| Creation Challenge* | N/A | 58.2–75.7†| — |
*Standard CREATOR, no hints, scored 58.2%; utility-level hints and full I/O hints raise to 67.2% and 75.7%, respectively.
Ablation studies demonstrated that explicit disentangling of abstraction and execution stages provided a 5–7% accuracy increase relative to "entangled" prompting. Rectification contributed approximately 10% absolute gain on challenging math and table tasks. CREATOR performance was robust against problem difficulty, in contrast to baseline degradation.
6. The Creation Challenge Dataset: Construction and Significance
The Creation Challenge dataset comprises 2,000 problems intentionally crafted such that no standard API or library directly applies. Each instance includes a natural language description, canonical tool specification (utility description, I/O signature, reference implementation), sample decision code, and a gold answer key. Dataset construction involved manual seeding followed by tenfold expansion with LLM outputs (Text-Davinci-003) filtered for novelty and difficulty. Problem domains cover polynomial fitting, combinatorics, geometry, and data analysis, explicitly rewarding the ability to invent new utilities. Evaluation involves zero-shot CREATOR with variable hinting to probe tool-creation capability.
7. Knowledge Transfer, Emergent Tool-Creation, and Future Directions
Analysis of transfer and abstraction uses 300 queries grouped into 100 clusters with shared core concepts. When a correct tool from one scenario is transferred to siblings within its cluster, accuracy improves from 63.0% to 78.3% (a 15.3-point gain), illustrating CREATOR's facilitation of abstracted cross-task generalization.
Tool creation abilities exhibit distinct stratification: (1) enhancement/wrapping of an existing API for new purposes, (2) pipeline concatenation of multiple APIs into a single utility, and (3) hierarchical construction where master tools invoke subtools.
Potential future extensions include scaling to open-ended tool repositories (discovery and publication), generalization to multimodal and complex software engineering tasks, and the development of automated metrics for LLM-generated code quality. CREATOR represents a step toward autonomous generative agents capable of identifying not just "which API to call," but "which API to author" for novel problem domains (Qian et al., 2023).