Multi-Turn Tool-Integrated Reasoning (TIR)

Updated 22 May 2026

Multi-Turn TIR is a framework where language models alternate between natural language reasoning and tool invocations to adapt their responses based on observed outputs.
It leverages reinforcement learning techniques, such as PPO variants and dual-level advantage estimation, to optimize sequential decision-making and reward assignment.
The paradigm enhances dynamic error correction, efficient tool selection, and integration across diverse applications like SQL generation and visual reasoning.

Multi-Turn Tool-Integrated Reasoning (TIR) is a paradigm in which LLMs alternately produce natural language reasoning, invoke external tools (such as code interpreters or search engines), and adapt their subsequent reasoning based on observed tool outputs. Unlike single-turn reasoning, multi-turn TIR formalizes and operationalizes a sequential, adaptive interaction loop between LLM and tool(s), supporting long-horizon problem solving, dynamic error correction, and compositional, context-sensitive tool use across turns. This article surveys the formal framework, algorithmic foundations, behavioral features, and empirical outcomes of multi-turn TIR, with a focus on state-of-the-art reinforcement learning and hybrid approaches.

1. Formal Framework and Problem Definition

In multi-turn TIR, the agent (typically an LLM) is formalized as a policy $\pi_\theta$ mapping the current state—which includes the user question $Q$ and a complete history of all previous reasoning, tool calls, and tool outputs—to a next action that may comprise one or more of: a segment of natural language reasoning, a tool invocation with specified arguments, or a final answer declaration (Wei et al., 29 Jul 2025, Guo et al., 10 Apr 2026, Lu et al., 24 Nov 2025). At each turn $t$ , the state $s_t$ encodes $Q$ and the full dialogue and tool-use context so far; the action $a_t$ is typically an interleaved sequence with special control segments (e.g. > ..., <search>...</search>, <code>...</code>).

A general TIR trajectory can be denoted as:

$(r_1, a_1, o_1),\; (r_2, a_2, o_2),\; ..., (r_T, a_T, o_T)$

where $r_t$ is the LLM's reasoning, $a_t$ encodes optional tool calls, and $o_t$ is the observed tool result (or $Q$ 0 if no tool is used at that step) (Chen et al., 11 Jan 2026, Huang et al., 23 Jun 2025, Zhang et al., 1 Feb 2026).

Multimodal and database-integrated versions further extend the state/action space to include image inputs, programmatic tool APIs, or SQL execution environments (Lu et al., 24 Nov 2025, Xu et al., 29 Oct 2025).

The overarching learning objective is to optimize the expected cumulative reward over multi-turn trajectories according to a task-specific criterion, such as answer correctness, reasoning trace validity, and/or tool-use quality.

2. Reinforcement Learning and Optimization Paradigms

The most prominent class of algorithms for multi-turn TIR is reinforcement learning (RL) with group-based advantage estimation, generally extending the Proximal Policy Optimization (PPO) or Group Relative Policy Optimization (GRPO) paradigm to account for tool-use actions and trajectory-level or turn-level rewards (Wei et al., 29 Jul 2025, Zhang et al., 1 Feb 2026, Ding et al., 18 Nov 2025, Wang et al., 18 May 2026).

AutoTIR introduces a hybrid reward structure combining action rewards (reflecting correct tool selection and penalizing misuse) and output rewards (reflecting final answer correctness and output structure) (Wei et al., 29 Jul 2025). The overall reward at each trajectory is

$Q$ 1

and policy optimization proceeds over batches of $Q$ 2 parallel rollouts, using normalized group advantages and a clipped PPO-style surrogate objective.

Subsequent works recognize that trajectory-level rewards may be too sparse for long-horizon multi-turn reasoning. Group Turn Policy Optimization (GTPO) implements turn-level reward assignment, assigning explicit feedback for individual reasoning-tool turns and propagating advantage estimates over discounted future return, as well as reward shaping through code similarity to correct solutions (Ding et al., 18 Nov 2025). Dual-level or hierarchical advantage schemes, as in MatchTIR (Qu et al., 15 Jan 2026) and Implicit Hierarchical GRPO (IH-GRPO) (Wang et al., 18 May 2026), integrate local (turn-level or decision-point) and global (trajectory-level) feedback to drive both precise stepwise behavior and global task success.

Empirically, these fine-grained algorithms outperform purely trajectory-rewarded GRPO baselines by substantial absolute margins, especially on long-horizon benchmarks (e.g., MatchTIR's dual-level advantage improves Qwen3-8B F1 on complex tool tasks by 4.7 points over trajectory-only baselines) (Qu et al., 15 Jan 2026).

3. Data Curation, Behavior Calibration, and Training Stability

Training stability in multi-turn TIR is a central challenge due to distributional drift (feedback from tool outputs introduces out-of-distribution inputs), the compounding effect of errors through multiple steps, and the risk of mode collapse (e.g., always choosing one tool or underutilizing tools) (Wei et al., 29 Jul 2025, Guo et al., 10 Apr 2026, Xue et al., 2 Sep 2025).

Several innovations address these issues:

Interaction-Dense Cold Start: ASTER demonstrates that initializing with a small but high-interaction-density SFT dataset (i.e., trajectories with many tool calls per problem) preserves policy entropy during subsequent RL, avoiding early collapse and sustaining exploration, leading to SOTA results (90% accuracy on AIME2025 with a 4B model) (Zhang et al., 1 Feb 2026).
Branching Exploration and Expert-Guided Warmup: E3-TIR incorporates expert-prefix branching, where the RL agent branches around high-entropy ("uncertain") points from expert demonstrations, balancing exploration (via self-sampled rollouts) and exploitation (around expert-anchored prefixes/branches) (Guo et al., 10 Apr 2026). Theoretical analysis shows that branching at key nodes exponentially increases the probability of generating correct long-horizon trajectories compared to pure on-policy exploration.
Behavior Calibration and Pareto RL: The ET-Agent architecture systematically generates a flywheel of self-evolving trajectory variants—correct, incorrect, globally-refined, or locally-corrected—and uses Pareto sampling and curriculum RL to calibrate the agent against both correctness and behavior dispersion (measured as variance in tool-call counts), penalizing redundancy and incentivizing concise, efficient behavior patterns (Chen et al., 11 Jan 2026).

Critically, training dynamics benefit from rejecting or filtering degenerate trajectories, such as those containing "void turns" (neither code nor answer), as in SimpleTIR, which stabilizes gradient norms and prevents learning collapse without requiring heavy SFT (Xue et al., 2 Sep 2025). In RL with external tools, masking the environment feedback tokens during policy updates prevents distribution-shift-induced explosions in parameter updates (Lu et al., 24 Nov 2025, Wei et al., 29 Jul 2025).

4. Autonomous Tool Selection, Integration, and Coordination

While earlier TIR methods relied on static, pre-specified tool use or imitation learning from fixed traces, leading approaches endow LLM agents with autonomous decision-making over whether and which tool to call in each context (Wei et al., 29 Jul 2025, Lu et al., 24 Nov 2025). In AutoTIR, for example, the agent generates at every reasoning step a <think> natural language block, optionally follows with a <search> or <code> tool invocation, and produces a boxed answer when appropriate (Wei et al., 29 Jul 2025). This flexible format allows the model to defer tool use, combine multiple forms of reasoning, or bypass tool invocation entirely.

Tool integration is operationalized via explicit control tags in the output stream; execution environments respond with standardized formats (e.g., <result>...</result>) that are appended to the reasoning context (Lu et al., 24 Nov 2025, Xu et al., 29 Oct 2025). The masking of tool outputs in the backward pass ensures that policy updates affect only the LLM, not the environment (Wei et al., 29 Jul 2025).

In specialized task settings:

MTIR-SQL enables multi-turn tool integration in Text-to-SQL generation, allowing stepwise query construction interleaved with real-time execution feedback, supporting dynamic correction of intermediate errors and progressive refinement (Xu et al., 29 Oct 2025).
VISTA-Gym (for visual reasoning) expands the multi-turn tool-integration paradigm to multimodal environments with 26 visual tools and a standardized API, highlighting the necessity of coordinated, programmatic tool selection in vision-language agent RL (Lu et al., 24 Nov 2025).

The generalization of these mechanisms across tasks and domains is evidenced by their performance on knowledge-intensive QA, mathematical reasoning, logical reasoning, and database coding tasks, as well as their robustness to out-of-distribution tool sets (Wei et al., 29 Jul 2025, Xu et al., 12 Apr 2026, Lu et al., 24 Nov 2025).

5. Credit Assignment, Reward Structures, and Evaluation

Effective learning of multi-turn TIR policies depends on precise, temporally informed credit assignment. State-of-the-art algorithms employ:

Hybrid Reward Structures: AutoTIR combines action-level (tool selection) and output-level (answer correctness and format) rewards, tuned to match the downstream utility of each class of action per task (Wei et al., 29 Jul 2025). Reward coefficients are made explicit (e.g., $Q$ 3, $Q$ 4).
Turn-Level and Step-Level Attribution: GTPO and MatchTIR assign fine-grained rewards to individual turns or tool calls, using bipartite matching between predicted and ground-truth tool-use traces, hard or soft assignment of partial credit per tool call, and explicit penalty for unmatched or erroneous actions (Ding et al., 18 Nov 2025, Qu et al., 15 Jan 2026).
Self-Supervised Reward Shaping: GTPO measures the code-level similarity of negative (incorrect) and positive (correct) trajectories, assigning partial reward to non-trivial but not fully correct behavior (Ding et al., 18 Nov 2025).
Dual-Level Advantage Estimation: MatchTIR's sum of trajectory-level (global) and turn-level (local, discounted future) group advantages addresses both global task outcome and local execution quality, yielding significant improvement in hard, long-horizon tool-use tasks (Qu et al., 15 Jan 2026).

Comprehensive evaluation leverages discriminative benchmarks including mathematical (AIME2024/5, MATH500, AMC23), multi-hop QA (HotpotQA, 2WikiMultiHopQA, MuSiQue, Bamboogle), and open-ended tasks (LogiQA, IFEval), with rigorous reporting of exact match, F1, tool selection accuracy (TS), and tool productivity (TP) (Wei et al., 29 Jul 2025). Robust evaluation tools such as TIDE-Bench further integrate process reliability, tool-use efficiency, and inference cost, exposing persistent deficits in tool grounding and multi-tool coordination (Li et al., 10 May 2026).

6. Adaptation, Efficiency, and Inference-time Optimization

Beyond supervised and RL-based training, recent methods address efficiency and robustness at inference time:

PruneTIR implements training-free, post-hoc trajectory pruning and tool call resampling based on real-time detection of erroneous tool outputs and recurrent failure. Success-triggered pruning, stuck-triggered resampling, and retry-triggered tool suspension collectively increase Pass@1 by over 10 points and halve the number of wasteful tool invocations, without any model updates (Zhang et al., 11 May 2026).
Outcome Efficiency and Overthinking Mitigation: Empirical analysis demonstrates that TIR-enabled models achieve higher outcome efficiency (more correct solutions per token or turn), reach correct conclusions in fewer steps, and exhibit reduced “overthinking” and redundant tool use compared to non-TIR or static tool-invocation baselines (Zhao et al., 21 Aug 2025).

7. Extensions, Limitations, and Future Directions

Current multi-turn TIR frameworks are being extended along several dimensions:

Internalized Tool Knowledge: TInR-U demonstrates that LLMs can internalize tool schemas and usage patterns, supporting tokenized invocation without external documentation, which yields higher accuracy and efficiency especially as tool library size increases (Xu et al., 12 Apr 2026).
Adaptive Tool Creation: UCT enables the agent to create and update new tools on the fly by distilling reusable assets from previously successful (or failed) reasoning traces, with continual consolidation and pruning for long-term reusability (Shen et al., 2 Feb 2026).
Hierarchical and Delayed Execution: IH-GRPO formalizes decoupled tool invocation versus execution decisions, adding an implicit hierarchical control layer (modeled via a surrogate loss) that improves reasoning coherence and buffered computation (Wang et al., 18 May 2026).

Persisting limitations include limited scaling to very large model and tool suites, reliance on high-quality expert or repairer trajectories for warm start, and the need for domain adaptation to handle out-of-distribution tools, modalities, and open-ended problem spaces. Open research challenges involve integrating dense, stepwise evaluators, enabling curriculum learning for multi-tool workflows, and systematizing reward scaling for efficiency/accuracy trade-offs (Lu et al., 24 Nov 2025, Li et al., 10 May 2026).

Ultimately, multi-turn TIR represents a convergent thread uniting principles from reinforcement learning, planning, preference modeling, and model-based tool-use, yielding LLM agents capable of robust, adaptive, long-horizon dynamic reasoning with tools across mathematical, scientific, and knowledge-intensive task domains.