Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 80 tok/s

Gemini 2.5 Pro 55 tok/s Pro

GPT-5 Medium 32 tok/s Pro

GPT-5 High 28 tok/s Pro

GPT-4o 104 tok/s Pro

Kimi K2 194 tok/s Pro

GPT OSS 120B 452 tok/s Pro

Claude Sonnet 4.5 29 tok/s Pro

2000 character limit reached

Tool-Integrated Data Synthesis Pipeline

Updated 3 July 2025

Tool-integrated data synthesis pipeline is a systematic approach for generating and curating tool-use datasets to train large language models in multi-step reasoning.
It employs tool-integrated prompting, hint-based sampling, and rigorous quality normalization to ensure diverse and accurate tool-use trajectories.
The staged training framework, combining supervised fine-tuning and self-critic reinforcement learning, enables effective and collaborative multi-tool reasoning.

A tool-integrated data synthesis pipeline, as instantiated by the Tool-Star framework, is a systematized approach to generating, curating, and leveraging datasets that teach LLMs to reason through coordinated tool use. Such pipelines are critical for constructing multi-step, multi-tool reasoning benchmarks, particularly for LLMs that must autonomously invoke, compose, and utilize external tools within complex problem-solving trajectories.

1. Architecture of the Tool-Integrated Data Synthesis Pipeline

The pipeline in Tool-Star is structured to generate high-quality tool-use data at scale, explicitly addressing the scarcity and low diversity of available multi-tool collaborative reasoning corpora. It comprises three primary stages:

Data Collection and Sampling: Generation of detailed tool-use trajectories using tool-integrated prompting and hint-based augmentation.
Tool-Use Quality Normalization: Filtering and standardizing produced traces to ensure rationality and correct usage.
Difficulty-Aware Classification: Systematic categorization of samples into curriculum stages, enabling staged training.

Each stage is designed to interface smoothly with downstream reinforcement learning (RL) processes for LLM policy optimization.

2. Tool-Integrated Prompting and Hint-Based Sampling

Tool-Integrated Prompting

This sampling method prompts LLMs to decide autonomously "when" and "how" to call external tools (e.g., Search, Python code execution). Prompts use explicit tokens such as <search>...</search> and <python>...</python>, and external scripts execute these tool calls, feeding results (in <result>...</result> blocks) back into the model’s context. The process continues iteratively—thought, tool invocation, result, next step—until an answer is obtained or resource limits are reached. Only samples resulting in a correct answer are retained.

The sampling formula (Equation 1): $P_\theta(\mathcal{R}^{c}, y \mid I^{\mathcal{T}}, q, \mathcal{T}) = \prod_{t=1}^{T_c} P_\theta (\mathcal{R}_t^c \mid \mathcal{R}^{c}_{<t}, I^{\mathcal{T}}, q, \{F^{\mathcal{T}}\}_{<t}) \cdot \prod_{t=1}^{T_y} P_\theta (y_t \mid y_{<t}, \mathcal{R}^c, I^{\mathcal{T}}, q)$

Hint-Based Sampling

To encourage richer tool-use diversity, the pipeline introduces "hints" at uncertain or verification-prompting points in language-only reasoning traces. Hints may indicate uncertainty (“not sure”), insert explicit tool-marker tokens, or request post-hoc answer validation. Upon hint injection, the LLM resumes reasoning with explicit tool use. Again, only correct completions are retained.

(Equation 2 details the process): $P_\theta(\mathcal{R}^{c}_{>t_H}, y \mid I^{\mathcal{T}}, q, \mathcal{R}^{c}_{\leq t_H}, \mathcal{T}) = \prod_{t = t_H}^{T_c} P_\theta (\mathcal{R}_t^c \mid \mathcal{R}^{c}_{\leq t}, I^{\mathcal{T}}, q, \{F^{\mathcal{T}}\}_{\leq t}) \cdot \prod_{t=1}^{T_y} P_\theta (y_t \mid y_{<t}, \mathcal{R}^c, I^{\mathcal{T}}, q)$ where $t_H$ is the hint-injection timestep.

Combining these strategies, the pipeline produces a broad and representative dataset ( $D_\text{tool}^{v1}$ ).

3. Quality Normalization and Difficulty Classification

All generated samples undergo a normalization process:

Tool-call Frequency Control: Samples with excessive tool invocation are discarded (threshold $\beta$ ).
Duplicate Tool Call Removal: Repetitive, identical calls inside a single trace are filtered out.
Format Standardization: Consistent special token usage and paired start/end tags are strictly enforced.

This yields a high-quality dataset ( $D_\text{tool}^{v2}$ ).

For curriculum-based training, a difficulty-aware classifier further assigns each sample to one of four categories (based on language-only and tool-integrated reasoning correctness):

Both correct (tool not needed; SFT).
Language-only correct, tool-use incorrect (rare; deprioritized).
Language-only incorrect, tool-use correct (tool essential; SFT).
Both incorrect (challenging cases; used for RL).

Separate subsets are used for supervised fine-tuning and RL, enabling a staged training strategy.

4. Two-Stage Training Framework

Stage 1: Cold-Start Supervised Fine-Tuning

The model ( $\hat{\pi}_\theta$ ) is first trained on "easy" cases via standard maximum-likelihood loss: $\mathcal{L}(\theta) = -\sum_{(x_i, y_i) \in D_\text{tool}^{SFT}} \log P_\theta(y_i \mid x_i)$ where $x_i$ includes the context and input, while $y_i$ is the correct tool-augmented trajectory.

Stage 2: Multi-Tool Self-Critic RL

The RL stage refines multi-tool collaboration with a memory-based rollout approach and hierarchical reward function. The reward structure incentivizes:

Correct answer production.
Proper tool-invocation formatting.
Bonus ( $r_M = 0.1$ ) for correctly using multiple tools within a reasoning trace: $R = \begin{cases} \max(\text{Acc.} + r_M, \text{Acc.}) & \text{if format OK, Acc. > 0} \ 0 & \text{if format OK, Acc. = 0} \ -1 & \text{otherwise} \end{cases}$

RL updates use Group Relative Policy Optimization (GRPO) and are followed by a Self-Critic Direct Preference Optimization (DPO) phase, where the model is tasked with ranking and learning from its own outputs, adjusting behavior with respect to the hierarchical reward function:

$\mathcal{L}_\text{DPO}(\pi_\theta;\pi_\text{ref}) = -\mathbb{E}_{(x, y_w, y_l)} \bigg[\log \sigma \Big(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_\text{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_\text{ref}(y_l|x)}\Big)\bigg]$

5. Empirical Significance: Collaborative Multi-Tool Reasoning

Tool-Star’s pipeline creates a scalable path to robust multi-tool reasoning by:

Generating diverse tool-use demonstrations.
Filtering for sample quality and format.
Progressively exposing the model to increasingly complex/necessitated tool-use via curriculum learning.
Reinforcing collaborative behavior through an explicit reward structure that values both correctness and tool composition.

Experimental evaluation over ten challenging benchmarks confirms that this approach results in significant improvements in effectiveness and efficiency of multi-tool collaborative LLMs, demonstrating that such a pipeline is central to advancing the state of tool-integrated reasoning.

6. Tabular Summary of Key Pipeline Components

Step	Methodology	Purpose
Data Sampling	Tool-integrated prompts, hint-based augmentation	Diversity of tool-use trajectories
Quality Normalization	Frequency/duplicate filtering, format regularization	Ensure rational and correct tool usage
Difficulty Classification	Language-vs-tool reasoning accuracy	Curriculum data partitioning
Cold-Start Supervised FT	SFT on easy/medium cases	Teach basic syntax and tool invocation
Multi-Tool RL (GRPO + DPO)	Self-critic RL with hierarchical rewards	Efficient, collaborative, reward-aligned tool use

7. Concluding Perspective

The tool-integrated data synthesis pipeline as implemented in Tool-Star demonstrates a structured, reproducible, and modular approach to equipping LLMs with advanced, multi-step, collaborative tool-use capabilities. By combining tool-integrated prompting, hint-based augmentation, rigorous quality normalization, difficulty-aware sample classification, and staged RL-aligned training, the pipeline enables the emergence of transparent, effective, and generalizable LLM agents for real-world tool reasoning tasks.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to Tool-Integrated Data Synthesis Pipeline.

Continue Learning

We haven't generated follow-up questions for this topic yet.

Generate Now