ToolACE: Scalable LLM Tool Framework

Updated 18 February 2026

ToolACE is a family of frameworks and toolchains that equip LLMs with scalable, data-centric tool-use and robust API interactions.
It employs methodologies including self-evolutionary API synthesis, adaptive self-refinement, and non-autoregressive multi-turn dialogue generation.
Empirical results show ToolACE variants achieve state-of-the-art performance on function-calling leaderboards and agent web routing benchmarks.

ToolACE refers to a family of frameworks and data-centric toolchains for equipping LLMs with scalable, robust capabilities for tool-use, programmatic API interaction, multi-turn agentic workflows, and rigorous technical evaluation. ToolACE encompasses several methodological advances—including self-evolutionary API pool synthesis, agent-based and non-autoregressive dialogue generation, self-improving fine-tuning, and the development of benchmark datasets and routing agents—that are foundational for modern tool-augmented LLM agents. ToolACE and its variants (including ToolACE-MT, ToolACE-R, ToolACE-DEV, and ToolACE-MCP) have established state-of-the-art performance on function-calling leaderboards and agent web routing benchmarks. The following sections provide an in-depth analysis of the architectures, methodologies, data generation and curation pipelines, empirical results, comparative advantages, and ongoing research challenges in the ToolACE ecosystem.

1. Agentic Data Generation and Self-Evolution Mechanisms

The original ToolACE framework (Liu et al., 2024) introduces the Tool Self-Evolution Synthesis (TSS) process, comprising three recursive stages: Speciation (extraction of high-level domains and functionalities from technical corpora to form a hierarchical API context tree), Adaption (assignment of complexity scores and structured parameterization), and Evolution (LLM-driven API synthesis with diversity operators to maximize compositional and type diversity). This yields a comprehensive API pool (26,507 APIs over 30 domains, 390 subdomains, 3,398 fine-grained classes), balanced by type and structural complexity.

Dialog generation leverages a multi-agent simulation. Three agents—User, Assistant, Tool—interact to produce multi-turn, programmatic, and naturalistic function-calling dialogues. The Assistant’s chain-of-thought unfolds as query interpretation, tool necessity decision, and parameter completeness checking. Explicit self-consistency and formalized thinking procedures enforce both decision diversity and agent robustness. Output samples are stratified for complexity and semantic diversity.

The data validation pipeline is dual-layered: a rule-based validator (for JSON-schema compliance, parameter typing, consistency, and structure) is augmented by multiple LLM-based discriminators targeting hallucination detection, intent satisfaction, and simulated tool output consistency. Human spot checks may be applied to high-uncertainty or low-confidence samples. Empirically, this strategy yields function-calling datasets with high syntactic and semantic fidelity—models fine-tuned on ToolACE data achieve 91.41% overall on the Berkeley Function-Calling Leaderboard (BFCL), surpassing all contemporaneous closed-source baselines (Liu et al., 2024).

ToolACE-R (Zeng et al., 2 Apr 2025) extends the core pipeline with adaptive self-refinement for tool learning. Rather than a one-shot fine-tuning, ToolACE-R implements a model-aware, iterative curriculum that progressively incorporates harder data as the model improves. This is operationalized via pass@k data selection (retaining only samples that the base model can solve within k attempts), self-refinement data augmentation (iteratively generating and re-refining tool call traces), and an adaptive stopping criterion during inference (halting when the model’s output stabilizes).

The adaptive process is both compute-efficient and model-aligned: empirical results demonstrate that adaptive self-refinement achieves 86.49% on BFCL, outperforming both prior open and closed models in function-calling accuracy. This approach also generalizes across architectures (LLaMA, Qwen, DeepSeek) and scales with model size, with empirical gains of 4–8 points over baselines. The self-refinement technique is crucial for closing the performance gap to cutting-edge proprietary models while maintaining cost-effectiveness (Zeng et al., 2 Apr 2025).

3. Non-Autoregressive and Modular Multi-Turn Dialogue Generation

ToolACE-MT (Zeng et al., 18 Aug 2025) advances agentic data construction by introducing a non-autoregressive, mask-and-fill iterative generation framework. This departs from the standard multi-agent autoregressive simulation—where each turn is strictly conditioned on all history and requires O(N²⁾ LLM invocations for N-turn dialogues—by first generating a coarse dialogue skeleton via high-level plan sampling, then iteratively refining masked dialogue spans through LLM-based complexity injection and reasonability refinement passes.

Acceptance is controlled through a combination of rule-based and learned model-based offline verification. The model judger is trained on labeled dialogue accept/reject pairs, optimizing for both local (span-level) and global (dialogue coherence) correctness. ToolACE-MT demonstrates a 30% LLM call reduction and +9 point multi-turn accuracy gain over autoregressive simulation baselines (BFCL multi-turn: 40.25% vs. 31.38%), establishing a new paradigm for agentic data creation at scale (Zeng et al., 18 Aug 2025).

4. Self-Improving Tool Learning and Autonomy

ToolACE-DEV (Huang et al., 12 May 2025) addresses tool learning via full self-improvement: after initial fine-tuning on raw tool documentation and a moderate-sized GPT-4-synthesized gen+inv dataset, the model is allowed to bootstrap entirely new training data from unlabeled user queries via generative self-evolution. The process decomposes tool learning into (a) tool documentation adaptation, (b) query-aware tool generation, and (c) tool invocation, jointly optimized via multi-task objectives.

The core self-evolving loop relies on the model's own outputs, filtered for format and semantic correctness via rule checkers and majority-vote self-consistency, iteratively increasing data diversity and matching training supervision to the model’s evolving strengths. This substantially reduces dependence on expensive external models (post-init data is fully model-generated), mitigates data-style mismatch, and preserves privacy; empirical results on BFCL and API-Bank confirm that adaptive self-evolution matches or exceeds data-synthesis-based approaches, and generalizes across model families (Huang et al., 12 May 2025).

5. Routing and Orchestration in Large-Scale Agent Ecosystems

As open agent ecosystems (Agent Web, MCP protocol) scale tool and agent inventories to the thousands, ToolACE-MCP (Yao et al., 13 Jan 2026) provides a scalable history-aware router. This router is trained by (1) building a dependency-rich graph of semantically-functionally related candidates, (2) synthesizing multi-turn dialogue trajectories via LLM-driven random walk and task planning, and (3) extracting supervision at every decision point for routing model fine-tuning.

The routing policy $\pi_\theta$ maps (history, query, candidate pool) tuples to a categorical distribution over possible actions, supporting robust plug-and-play integration into heterogeneous execution backends. Empirically, ToolACE-MCP outperforms all embedding-based and LLM-based alternatives—including GPT-4o—on MCP-Universe and MCP-Mark benchmarks, and exhibits strong generalization to agent routing and scalability in the presence of noise or adversarial distractors (Yao et al., 13 Jan 2026).

6. Tool Enrichment, Dynamic Retrieval, and Agentic Integration

The Automated Creation and Enrichment framework (ToolACE, as per (Agarwal et al., 15 Sep 2025)) addresses the critical “last-mile” challenges of tool-for-agent readiness: lack of enterprise API documentation, ambiguous schemas, and large operation catalogs. ToolACE fully automates the parsing of OpenAPI flows, LLM-based enrichment and completion of tool schemas (description, parameter doc, usage examples), and implements embedding-based dynamic shortlisting at runtime to cap prompt size at fixed top-k tool sets.

These methods yield robust selection and input accuracy improvements; for instance, “Enrich-3” reduces input formation errors on Kubernetes APIs from 11.4% to 3.5%. Deployment in production agentic flows confirms a 27% uplift in aggregate selection and calling accuracy. The framework is model-agnostic and compatible with agentic architectures such as ReAct, further demonstrating its generality and practical impact (Agarwal et al., 15 Sep 2025).

7. Limitations, Evaluation, and Prospective Directions

Several cross-cutting challenges are identified in the ToolACE ecosystem. Synthetic data may not perfectly model every aspect of real-world APIs (e.g., external dependencies, rare error signatures). Model-based verification still inherits the hallucination and false-positive risks present in underlying LLMs. Complexity remains bounded—call chains ≥5 remain underrepresented. Tool retrieval from truly massive candidate pools is still an open research area; only selection from pre-shortlisted sets is addressed in present implementations. Scaling to larger models (14B+) and online, live feedback is constrained by compute and annotation costs (Liu et al., 2024, Huang et al., 12 May 2025).

Ongoing work targets several extensions: multi-file and architectural-level program refactoring; integration of formal methods for even stricter semantic guarantees; multi-agent, adversarial co-evolution; and direct learning from API logs and live user traces. The recent emergence of non-autoregressive and router-centric agent architectures points toward an evolving toolkit for generalizable, robust tool-orchestrated LLM systems.

References

"ToolACE: Winning the Points of LLM Function Calling" (Liu et al., 2024)
"ToolACE-R: Tool Learning with Adaptive Self-Refinement" (Zeng et al., 2 Apr 2025)
"ToolACE-MT: Non-Autoregressive Generation for Agentic Multi-Turn Interaction" (Zeng et al., 18 Aug 2025)
"ToolACE-DEV: Self-Improving Tool Learning via Decomposition and EVolution" (Huang et al., 12 May 2025)
"ToolACE-MCP: Generalizing History-Aware Routing from MCP Tools to the Agent Web" (Yao et al., 13 Jan 2026)
"Automated Creation and Enrichment Framework for Improved Invocation of Enterprise APIs as Tools" (Agarwal et al., 15 Sep 2025)