Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Tool-Integrated Data Synthesis Pipeline

Updated 3 July 2025
  • Tool-integrated data synthesis pipeline is a systematic approach for generating and curating tool-use datasets to train large language models in multi-step reasoning.
  • It employs tool-integrated prompting, hint-based sampling, and rigorous quality normalization to ensure diverse and accurate tool-use trajectories.
  • The staged training framework, combining supervised fine-tuning and self-critic reinforcement learning, enables effective and collaborative multi-tool reasoning.

A tool-integrated data synthesis pipeline, as instantiated by the Tool-Star framework, is a systematized approach to generating, curating, and leveraging datasets that teach LLMs to reason through coordinated tool use. Such pipelines are critical for constructing multi-step, multi-tool reasoning benchmarks, particularly for LLMs that must autonomously invoke, compose, and utilize external tools within complex problem-solving trajectories.

1. Architecture of the Tool-Integrated Data Synthesis Pipeline

The pipeline in Tool-Star is structured to generate high-quality tool-use data at scale, explicitly addressing the scarcity and low diversity of available multi-tool collaborative reasoning corpora. It comprises three primary stages:

  1. Data Collection and Sampling: Generation of detailed tool-use trajectories using tool-integrated prompting and hint-based augmentation.
  2. Tool-Use Quality Normalization: Filtering and standardizing produced traces to ensure rationality and correct usage.
  3. Difficulty-Aware Classification: Systematic categorization of samples into curriculum stages, enabling staged training.

Each stage is designed to interface smoothly with downstream reinforcement learning (RL) processes for LLM policy optimization.

2. Tool-Integrated Prompting and Hint-Based Sampling

Tool-Integrated Prompting

This sampling method prompts LLMs to decide autonomously "when" and "how" to call external tools (e.g., Search, Python code execution). Prompts use explicit tokens such as <search>...</search> and <python>...</python>, and external scripts execute these tool calls, feeding results (in <result>...</result> blocks) back into the model’s context. The process continues iteratively—thought, tool invocation, result, next step—until an answer is obtained or resource limits are reached. Only samples resulting in a correct answer are retained.

The sampling formula (Equation 1): Pθ(Rc,yIT,q,T)=t=1TcPθ(RtcR<tc,IT,q,{FT}<t)t=1TyPθ(yty<t,Rc,IT,q)P_\theta(\mathcal{R}^{c}, y \mid I^{\mathcal{T}}, q, \mathcal{T}) = \prod_{t=1}^{T_c} P_\theta (\mathcal{R}_t^c \mid \mathcal{R}^{c}_{<t}, I^{\mathcal{T}}, q, \{F^{\mathcal{T}}\}_{<t}) \cdot \prod_{t=1}^{T_y} P_\theta (y_t \mid y_{<t}, \mathcal{R}^c, I^{\mathcal{T}}, q)

Hint-Based Sampling

To encourage richer tool-use diversity, the pipeline introduces "hints" at uncertain or verification-prompting points in language-only reasoning traces. Hints may indicate uncertainty (“not sure”), insert explicit tool-marker tokens, or request post-hoc answer validation. Upon hint injection, the LLM resumes reasoning with explicit tool use. Again, only correct completions are retained.

(Equation 2 details the process): Pθ(R>tHc,yIT,q,RtHc,T)=t=tHTcPθ(RtcRtc,IT,q,{FT}t)t=1TyPθ(yty<t,Rc,IT,q)P_\theta(\mathcal{R}^{c}_{>t_H}, y \mid I^{\mathcal{T}}, q, \mathcal{R}^{c}_{\leq t_H}, \mathcal{T}) = \prod_{t = t_H}^{T_c} P_\theta (\mathcal{R}_t^c \mid \mathcal{R}^{c}_{\leq t}, I^{\mathcal{T}}, q, \{F^{\mathcal{T}}\}_{\leq t}) \cdot \prod_{t=1}^{T_y} P_\theta (y_t \mid y_{<t}, \mathcal{R}^c, I^{\mathcal{T}}, q) where tHt_H is the hint-injection timestep.

Combining these strategies, the pipeline produces a broad and representative dataset (Dtoolv1D_\text{tool}^{v1}).

3. Quality Normalization and Difficulty Classification

All generated samples undergo a normalization process:

  • Tool-call Frequency Control: Samples with excessive tool invocation are discarded (threshold β\beta).
  • Duplicate Tool Call Removal: Repetitive, identical calls inside a single trace are filtered out.
  • Format Standardization: Consistent special token usage and paired start/end tags are strictly enforced.

This yields a high-quality dataset (Dtoolv2D_\text{tool}^{v2}).

For curriculum-based training, a difficulty-aware classifier further assigns each sample to one of four categories (based on language-only and tool-integrated reasoning correctness):

  1. Both correct (tool not needed; SFT).
  2. Language-only correct, tool-use incorrect (rare; deprioritized).
  3. Language-only incorrect, tool-use correct (tool essential; SFT).
  4. Both incorrect (challenging cases; used for RL).

Separate subsets are used for supervised fine-tuning and RL, enabling a staged training strategy.

4. Two-Stage Training Framework

Stage 1: Cold-Start Supervised Fine-Tuning

The model (π^θ\hat{\pi}_\theta) is first trained on "easy" cases via standard maximum-likelihood loss: L(θ)=(xi,yi)DtoolSFTlogPθ(yixi)\mathcal{L}(\theta) = -\sum_{(x_i, y_i) \in D_\text{tool}^{SFT}} \log P_\theta(y_i \mid x_i) where xix_i includes the context and input, while yiy_i is the correct tool-augmented trajectory.

Stage 2: Multi-Tool Self-Critic RL

The RL stage refines multi-tool collaboration with a memory-based rollout approach and hierarchical reward function. The reward structure incentivizes:

  • Correct answer production.
  • Proper tool-invocation formatting.
  • Bonus (rM=0.1r_M = 0.1) for correctly using multiple tools within a reasoning trace: R={max(Acc.+rM,Acc.)if format OK, Acc. > 0 0if format OK, Acc. = 0 1otherwiseR = \begin{cases} \max(\text{Acc.} + r_M, \text{Acc.}) & \text{if format OK, Acc. > 0} \ 0 & \text{if format OK, Acc. = 0} \ -1 & \text{otherwise} \end{cases}

RL updates use Group Relative Policy Optimization (GRPO) and are followed by a Self-Critic Direct Preference Optimization (DPO) phase, where the model is tasked with ranking and learning from its own outputs, adjusting behavior with respect to the hierarchical reward function:

LDPO(πθ;πref)=E(x,yw,yl)[logσ(βlogπθ(ywx)πref(ywx)βlogπθ(ylx)πref(ylx))]\mathcal{L}_\text{DPO}(\pi_\theta;\pi_\text{ref}) = -\mathbb{E}_{(x, y_w, y_l)} \bigg[\log \sigma \Big(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_\text{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_\text{ref}(y_l|x)}\Big)\bigg]

5. Empirical Significance: Collaborative Multi-Tool Reasoning

Tool-Star’s pipeline creates a scalable path to robust multi-tool reasoning by:

  • Generating diverse tool-use demonstrations.
  • Filtering for sample quality and format.
  • Progressively exposing the model to increasingly complex/necessitated tool-use via curriculum learning.
  • Reinforcing collaborative behavior through an explicit reward structure that values both correctness and tool composition.

Experimental evaluation over ten challenging benchmarks confirms that this approach results in significant improvements in effectiveness and efficiency of multi-tool collaborative LLMs, demonstrating that such a pipeline is central to advancing the state of tool-integrated reasoning.

6. Tabular Summary of Key Pipeline Components

Step Methodology Purpose
Data Sampling Tool-integrated prompts, hint-based augmentation Diversity of tool-use trajectories
Quality Normalization Frequency/duplicate filtering, format regularization Ensure rational and correct tool usage
Difficulty Classification Language-vs-tool reasoning accuracy Curriculum data partitioning
Cold-Start Supervised FT SFT on easy/medium cases Teach basic syntax and tool invocation
Multi-Tool RL (GRPO + DPO) Self-critic RL with hierarchical rewards Efficient, collaborative, reward-aligned tool use

7. Concluding Perspective

The tool-integrated data synthesis pipeline as implemented in Tool-Star demonstrates a structured, reproducible, and modular approach to equipping LLMs with advanced, multi-step, collaborative tool-use capabilities. By combining tool-integrated prompting, hint-based augmentation, rigorous quality normalization, difficulty-aware sample classification, and staged RL-aligned training, the pipeline enables the emergence of transparent, effective, and generalizable LLM agents for real-world tool reasoning tasks.