Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 78 tok/s

Gemini 2.5 Pro 60 tok/s Pro

GPT-5 Medium 28 tok/s Pro

GPT-5 High 33 tok/s Pro

GPT-4o 101 tok/s Pro

Kimi K2 168 tok/s Pro

GPT OSS 120B 452 tok/s Pro

Claude Sonnet 4.5 36 tok/s Pro

2000 character limit reached

TIRGen Data Construction Pipeline

Updated 19 September 2025

TIRGen is a synthetic data construction pipeline that employs a multi-agent actor-critic model to generate high-quality, tool-integrated reasoning data.
It systematically interleaves natural language explanations with executable code, refining reasoning trajectories through iterative execution and self-correction.
The pipeline underpins hierarchical reinforcement learning in THOR, demonstrating improved accuracy on benchmark mathematical reasoning and code tasks.

The TIRGen Data Construction Pipeline is a @@@@1@@@@ framework implemented in the context of THOR (Tool-Integrated Hierarchical Optimization via RL) to create high-quality datasets for training and evaluating tool-integrated mathematical reasoning systems. TIRGen adopts a multi-agent actor-critic model to systematically compose reasoning trajectories that interleave natural language explanation and executable code, yielding sequences that enable LLMs to learn fine-grained reasoning in tandem with tool use. This structured synthesis pipeline forms the foundational data resource for hierarchical RL optimization and inference with self-correction in tool-augmented mathematical reasoning.

1. Multi-Agent Actor-Critic Framework for Data Generation

TIRGen orchestrates two collaborative agents: the Actor and the Critic.

The Actor agent generates natural language reasoning steps ("think" phase) that explicate the solution approach for a mathematical problem.
The Critic agent analyzes each reasoning step to detect sub-parts that can be more reliably or efficiently solved via external code execution, such as explicit numeric calculations or algebraic manipulation.

When the Critic identifies a “code-solvable” segment, it extracts the logical portion from the natural language and produces a corresponding code snippet (typically Python). This code is then executed in a sandbox environment, and the observation (result) is incorporated back into the reasoning step. The revised step now consists of both a human-readable explanation and a validated result from external computation. This loop continues, alternating between free-form explanation and tool calls, iteratively constructing a reasoning trajectory that is both human-intelligible and precision-aligned.

2. Stepwise Tool-Integrated Trajectory Synthesis

The core mechanism of TIRGen’s pipeline is a granular, stepwise construction process:

a. Actor step generation: For a given problem, the Actor emits a reasoning step $r^t$ constrained to a maximum length $L_{\text{step}}$ .

b. Critic analysis and extraction: The Critic evaluates $r^t$ for segments amenable to code execution. If present, it extracts the code-solvable logic $r_{\text{logic}}^t$ and emits a code snippet $a^t$ .

c. Code execution and feedback: The code is executed within a code interpreter sandbox $S$ . The output $o^t$ is then used to refine or replace the initial reasoning step.

d. Trajectory construction: The refined explanation and code result are appended to the current solution path, and the loop repeats for subsequent steps.

e. Completion and filtering: Once the reasoning is complete, TIRGen applies a multi-stage filtering protocol to the trajectory, enforcing:

Format consistency (e.g., code formatting, correct answer tagging),
Code quality (ensuring the snippet is non-trivial, executes without error, and involves substantive operations),
Diversity (balancing tasks by difficulty and the number of tool calls).

This process produces a robust, high-quality cold start dataset $\mathcal{D}_{\text{SFT}}$ used for training.

3. Hierarchical Reinforcement Learning Optimization

THOR employs a dual-level RL optimization protocol, informed by the structure of TIRGen trajectories:

Trajectory-level RL: The policy $\pi_\theta$ is optimized against a reward defined by the correctness of the overall reasoning chain. The likelihood of a trajectory $\tau$ is modeled as

$P_{\pi_\theta}(\tau | q, I) = P_{\pi_\theta}(r^n | q, I, \mathcal{H}^{1:n-1}) \prod_{t=1}^{n-1} P_{\pi_\theta}(r^t | q, I, \mathcal{H}^{1:t-1}) P_{\pi_\theta}(a^t | r^t, q, I, \mathcal{H}^{1:t-1})$

where $r^t$ are reasoning steps, $a^t$ are tool actions, and $\mathcal{H}$ encodes the history.

Step-level RL: For steps where the tool-generated action fails (denoted $A_{\text{err}}$ ), the model backtracks to the associated reasoning prefix, regenerates the reasoning suffix and code, and uses the result of execution as a direct reward signal. These dual objectives are summed:

$\mathcal{L}^{\text{traj}}(\theta) = \mathbb{E}\left[\min \left(\frac{\pi_\theta(s_i|q)}{\pi_{\theta_{\text{old}}}(s_i|q)} A_i, \text{clip}\left(\frac{\pi_\theta(s_i|q)}{\pi_{\theta_{\text{old}}}(s_i|q)}, 1 - \epsilon_{\text{low}}, 1 + \epsilon_{\text{high}}\right)A_i\right)\right] + \alpha \mathcal{L}_{\text{NLL}}(\theta)$

Step-level correction is triggered specifically by failed tool executions, targeting fine-grained error remediation.

4. External Tool Integration

Tool integration in TIRGen is realized through an executable sandbox (e.g., Python interpreter with access to libraries such as sympy, numpy, or control flow constructs). The Critic’s code extraction module is responsible for converting reasoning logic into code, which is executed to provide precise numerical or symbolic outcomes. The returned results are used to augment or correct freeform explanations, ensuring that the data construction protocol systematically blends human-like abstraction with computational precision.

5. Self-Correction and Data Quality Assurance

During inference—and implicitly during data synthesis—TIRGen implements a self-correction mechanism. If code execution fails at any reasoning step, the trajectory is automatically rolled back to a predefined prefix. The problematic reasoning suffix and its associated code action are regenerated, and tool execution is retried. This process is repeated up to a fixed number of attempts ( $N_\text{corr}$ , e.g., 4), enabling robust recovery from intermediate tool failures and ensuring the overall data quality and validity.

6. Empirical Impact and Generalization

The TIRGen Data Construction Pipeline, as operationalized within THOR, has demonstrated strong generalization and superior empirical performance on mathematical reasoning benchmarks (MATH 500, AIME, AMC, Minerva Math, OlympiadBench) and code tasks (HumanEval, MBPP, LiveCodeBench). Models trained on TIRGen-constructed datasets exhibit improved accuracy on both trajectory-completion and code-generation tasks relative to comparable baselines. This empirical outcome substantiates the integration of tool-based feedback during both data construction and model optimization.

7. Availability and Prospects

All code for TIRGen, as well as the resulting datasets and the complete THOR framework, are expressly noted as publicly available at https://github.com/JingMog/THOR. This supports immediate uptake for research into tool-augmented reasoning and the design of hierarchical RL strategies. A plausible implication is that similar multi-agent construction strategies could be generalized to other domains that require fine-grained reasoning augmented by external computation, forming a new standard for data construction in tool-integrated AI research (Chang et al., 17 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning (2025)

Follow Topic

Get notified by email when new papers are published related to TIRGen Data Construction Pipeline.