InfTool: Autonomous Closed-Loop Tool-Use Framework

Updated 4 July 2026

InfTool is an autonomous framework that converts raw API specifications into structured, schema-based tool definitions and verified tool-use trajectories.
It employs a multi-agent role-playing system to simulate user interactions and optimize model performance via supervised fine-tuning and Group Relative Policy Optimization.
The closed-loop training mechanism continuously diagnoses failure modes and regenerates synthetic data, substantially improving function-calling accuracy.

InfTool is a fully autonomous framework for synthesizing tool-use data and training LLMs for reliable function-calling under an MCP-style setting. Starting only from raw API specifications, it converts APIs into schema-based tool definitions, generates verified tool-use trajectories through multi-agent role-playing, optimizes the model with supervised fine-tuning and Group Relative Policy Optimization, diagnoses failure modes, and regenerates new data targeted at those weaknesses in a closed loop. In the reported evaluation, this procedure improves a base 32B model from 19.8 to 70.9 total score on the Berkeley Function Calling Leaderboard, entirely from synthetic data and without human annotation (Li et al., 29 Dec 2025).

1. Conceptual model and task formulation

InfTool addresses a specific bottleneck in LLM agents: reliable tool use when the model must read tool specifications, decide whether tool invocation is needed, select the correct function, produce schema-valid arguments, and sometimes execute multi-step workflows over several turns. The framework is presented as a response to three limitations in prior tool-use training regimes: expensive human annotation for trajectories, weak generalization to new tools under static human-curated datasets, and a quality ceiling in one-shot self-synthesis where a single generator reproduces its own biases and coverage gaps (Li et al., 29 Dec 2025).

Its operational premise is that strong tool-use behavior can be learned from synthetic trajectories alone if data generation, verification, weakness analysis, and policy improvement are coupled into a self-evolving loop. In this formulation, tool use is treated as a policy

$\pi : (q, C) \mapsto \mathcal{A},$

where $q$ is the user query, $C$ is context, and $\mathcal{A}$ is the structured action space. Within the MCP setting, a tool call is represented as

$t = (f, \mathbf{a}),$

where $f$ is the function identifier and $\mathbf{a}$ is an argument vector constrained by a strict JSON schema $\mathcal{S}_f$ (Li et al., 29 Dec 2025).

The phrase “closing the loop” names the framework’s central mechanism. Synthesized data is used to train the policy; the improved policy is then rolled out; failure cases define the current capability frontier; new synthetic data is generated specifically for those failures; and the cycle repeats. This makes data generation an online component of optimization rather than a fixed preprocessing stage. A plausible implication is that InfTool is better understood as a continual capability-growth system than as a static synthetic-data pipeline.

2. Multi-agent role-playing and trajectory synthesis

InfTool’s role-playing architecture has three principal agents: a User Simulator, a Tool-Calling Assistant, and an MCP Server. In multi-turn settings, the framework also inserts a Task Generation Agent that defines the scenario before the dialogue begins. The User Simulator produces natural-language requests and maintains persona-consistent dialogue behavior. The Tool-Calling Assistant is the tool-using policy being trained. The MCP Server exposes schemas, checks tool calls, and returns observations needed for continued interaction (Li et al., 29 Dec 2025).

Component	Function	Notes
User Simulator	Generates requests and multi-turn dialogue	Controls persona and user knowledge
Tool-Calling Assistant	Selects tools and arguments	Acts as the MCP client
MCP Server	Executes schema-level interaction	Validates calls and returns observations
Task Generation Agent	Defines multi-turn scenario metadata	Used for complex dialogues

Single-turn synthesis covers standard execution, parallel tool execution, and irrelevance detection. The last category is structurally important because a tool-using model must learn when not to call a tool. Multi-turn synthesis begins from scenario metadata spanning user profile, known versus unknown information, user need, and difficulty. The framework description specifies 3–5 rounds for core multi-turn generation, while the dataset analysis reports dialogues ranging from 2 to 15 turns, with average 7.78 and median 8 (Li et al., 29 Dec 2025).

The generated corpus is deliberately heterogeneous. It spans 15 primary domains plus a long tail of 35 additional categories. Of the MCP tools, 82.83% require parameter specifications. The multi-turn subset contains 4,965 dialogues, and 20.52% of those exceed 10 turns. About 53.7% of the dataset consists of single tool-call cases, while multi-turn dialogues consume about four times as many tokens on average as single-turn ones. This distribution indicates that InfTool is not restricted to one-shot API formatting; it explicitly targets long-horizon, stateful interaction regimes.

3. Tool-space construction, verification, and filtering

The framework begins from a large RapidAPI crawl. It starts with 17,713 candidates and refines them into 3,059 high-fidelity synthetic MCP tools through the “MCP Tree,” a hierarchical refinement and deduplication mechanism. Let $\mathcal{T}^{(r)}$ denote the tool set at iteration $r$ , and let each tool $q$ 0 have semantic embedding $q$ 1. Candidate clusters are formed by semantic similarity: $q$ 2 An LLM then separates each cluster into redundant and unique subsets and synthesizes abstractions for redundant groups, producing the refined inventory $q$ 3 (Li et al., 29 Dec 2025).

This stage is empirically consequential. Removing the MCP Tree drops overall BFCL performance from 70.9 to 55.21, indicating that tool deduplication and abstraction are not peripheral preprocessing but part of the learned capability boundary. This suggests that large-scale tool-use training depends not only on trajectory quality but also on the semantic topology of the tool inventory.

InfTool uses a two-layer quality-assurance pipeline. The first layer is in-trajectory self-reflection, intended to correct style violations, hallucinated tool requests, and schema errors before they propagate. The second layer is post-trajectory multi-agent validation through a voting-deliberation-correction process. The system also performs rejection analysis: a multi-agent voting mechanism pruned 82 multi-turn instances, roughly 1.2% of such cases, when the simulated user rejected incorrect tool use even without explicit hallucination (Li et al., 29 Dec 2025).

The asymmetry between single-turn and multi-turn generation is substantial. Reported correction success is 87.65% for single-turn data but only 12.06% for multi-turn data. This sharp disparity anchors one of the paper’s major cautions: synthesis quality degrades rapidly as dialogue horizon grows.

4. Closed-loop training, hard-example targeting, and reward design

InfTool’s optimization pipeline combines cold-start supervised fine-tuning with iterative reinforcement learning. The initial synthetic dataset is used for full-parameter supervised fine-tuning with global batch size 512, learning rate $q$ 4, 3 epochs, and BF16 precision. The RL stage uses generation temperature 0.7, 1000 steps per iteration, and global batch size 1024. The main target model scales are Qwen2.5-7B and Qwen2.5-32B; supporting components include EmbeddingGemma for semantic clustering, Qwen3-Coder-30B-Instruct for tool refinement and data generation, and DeepSeek-R1 for chain-of-thought trajectory generation during cold start (Li et al., 29 Dec 2025).

The self-evolving step begins by evaluating the current policy $q$ 5 and stratifying examples according to an execution consistency score $q$ 6. Hard examples are defined as

$q$ 7

with $q$ 8 a threshold. These samples define the current learning frontier, and subsequent synthesis is targeted at them rather than sampled indiscriminately (Li et al., 29 Dec 2025).

Policy optimization uses Group Relative Policy Optimization. The paper’s distinctive contribution at this stage is a gated composite reward. The reward combines a format component for structural validity, a tool component for execution fidelity, and a teacher component for reasoning quality. The reasoning reward is granted only if the tool-use reward exceeds a threshold $q$ 9. In operational terms, the system does not reward plausible reasoning unless that reasoning produces functionally correct tool behavior. This design constrains verbal fluency to execution-grounded utility and is intended to suppress verbose but non-executable trajectories.

The resulting loop is structurally recursive: synthesize initial data, warm-start with SFT, roll out the current policy, identify hard cases, optimize with GRPO, synthesize new trajectories targeted to those hard cases, validate them, and repeat. A plausible implication is that InfTool’s main novelty lies less in any individual optimizer than in the coupling between diagnosis and regeneration.

5. Benchmark results, ablations, and generalization

The principal evaluation is on BFCL version 3. The authors report that BFCL v4 was not used because reproduction issues caused newly added items to yield zero values in their experiments. On this benchmark, InfTool-32B reaches 70.9 total score and is described as the top-performing open-source model in the table. The base 32B model improves from 19.8 to 70.9, a gain of 258.08%, and the SFT baseline improves further from 65.3 to 70.9 after iterative RL, a +5.6 gain (Li et al., 29 Dec 2025).

Performance gains are especially large on complex tool-use regimes. In multi-turn scenarios, InfTool-32B scores 59.0 versus 32.0 for Qwen2.5-32B-Instruct, while in live parallel tool execution it scores 87.5 versus 50.0 for the same baseline. At smaller scale, InfTool-7B reaches 61.7 total BFCL score, surpassing a reported GPT-5.2 score of 60.4 in the authors’ table. The paper also reports a 93.3% improvement rate on “Complex Length” scenarios, although many remaining absolute errors still come from “Simple” tasks because the benchmark contains many of them (Li et al., 29 Dec 2025).

Ablations reinforce the importance of the framework’s main modules. Removing the MCP Tree reduces overall score from 70.9 to 55.21. Removing self-reflection causes broad degradation and collapses the irrelevance metric from 88.9 to 4.86, indicating that the model loses the ability to refrain from tool use when tools are unnecessary. The closed-loop RL stages also matter for sample efficiency: the paper notes that SFT improves steadily from 1k to 50k samples, but RL uses 5k samples per iteration and requires only about half the total data volume to reach similar performance (Li et al., 29 Dec 2025).

Out-of-distribution robustness is assessed with $C$ 0-Bench. InfTool improves Retail from 40.0 to 67.0, Airline from 26.0 to 60.0, and Telecom from 21.0 to 67.5. Because $C$ 1-Bench emphasizes state-dependent interaction and collaborative troubleshooting rather than one-shot API syntax, these gains indicate that the synthesized trajectories transmit more than formatting regularities. At the same time, the paper explicitly notes that it does not present a clean unseen-tool split isolating held-out APIs from seen APIs, so claims about unseen-tool generalization are supported indirectly rather than by a dedicated held-out-tool benchmark.

6. Relation to adjacent paradigms and stated limitations

InfTool belongs to a broader shift toward scalable tool-use learning, but it occupies a distinct design point. ToolGen turns tool retrieval into autoregressive generation by assigning each tool a unique token and eliminating a separate retrieval stage for tool identity (Wang et al., 2024). TInR similarly internalizes tool knowledge into model parameters through dedicated tool tokens and a bidirectional documentation-token alignment regime, with the explicit aim of avoiding external tool documentation at inference time (Xu et al., 12 Apr 2026). FinToolSyn, by contrast, focuses on finance-specific forward synthesis, dynamic retrieval, and balanced tool-call versus non-tool-call behavior over a repository of 43,066 tools and 148,984 dialogues (Huang et al., 25 Mar 2026). PruneTIR addresses a different layer of the stack: inference-time control of already tool-capable models through success-triggered pruning, stuck-triggered resampling, and retry-triggered tool suspension (Zhang et al., 11 May 2026).

Against these alternatives, InfTool’s distinctive combination is autonomous multi-agent synthesis from raw API specifications, iterative weakness-targeted data regeneration, verification by self-reflection plus consensus filtering, and execution-grounded GRPO. This suggests a division of emphasis across the literature: ToolGen and TInR prioritize parametric tool internalization, FinToolSyn prioritizes forward synthesis and realistic retrieval noise in a specialized domain, and PruneTIR prioritizes trajectory control at inference time, whereas InfTool treats data synthesis and policy improvement as a coupled closed-loop system.

The paper is also explicit about its limitations. Because the entire loop is synthetic and human-free, there may be a simulation-to-reality gap: the user simulator is likely more rational and structured than real users. Self-reflection degrades in very long contexts. The framework is restricted to text-based JSON-RPC interactions and does not handle multimodal tool use. Multi-turn synthesis remains much harder than single-turn synthesis, as reflected in the low 12.06% correction success for multi-turn data. Finally, although the architecture and results support claims of broad tool-use generalization, the absence of a dedicated seen-versus-unseen tool split means that such generalization should be described as indirect rather than conclusively isolated (Li et al., 29 Dec 2025).

In that sense, InfTool is most accurately described as an autonomous closed-loop framework for tool-use capability growth: it constructs a refined tool space, synthesizes structurally varied and verified trajectories, trains with execution-centered rewards, identifies hard cases from model behavior, and resynthesizes data at the frontier exposed by those failures. Its significance lies not only in synthetic-data scale, but in turning synthesis itself into an iterative component of tool-use learning.