Multi-Turn Tool-Calling LLMs

Updated 9 December 2025

Multi-turn tool-calling LLMs are agent architectures designed for iterative API invocation with explicit planning and state tracking to solve compositional, context-dependent tasks.
They employ supervised, preference, and reinforcement learning strategies to optimize tool selection and error recovery while managing dynamic memory efficiently.
Empirical benchmarks reveal these models achieve higher accuracy and deeper search strategies compared to single-turn approaches, especially in complex, multi-step scenarios.

Multi-turn tool-calling LLMs are agent architectures that interact with external functions or APIs over multi-turn dialogues, leveraging both sequential tool invocations and their intermediate results to solve compositional, context-dependent tasks. This paradigm arises from the growing need to move beyond single-step retrieval augmented generation (RAG) and single-turn function calling, allowing LLMs to handle scenarios such as iterative search, data transformation, symbolic computation, and real-world API orchestration. Architectures employ explicit reasoning/planning steps, action selection for tool invocation, structured state tracking, reward-driven learning (via RL), and memory management to support long-horizon interaction and dynamic information flows.

1. Formal Problem Definition and Agent Frameworks

The multi-turn tool-calling setting formalizes the agent’s state at turn $t$ as $s_t = (q, h_t, z_0,\ldots,z_{t-1})$ , with $q$ the user query, $h_t$ the hidden embedding capturing history, and $z_i$ the outputs of prior tool calls (Kalyan et al., 28 Oct 2025). The action space $A$ comprises tool invocations—each with arguments—and the final answer. At every turn, the model policy $\pi^{A}_\theta$ can (a) select a tool and generate its arguments, or (b) terminate with the answer. State transitions include appending tool feedback to the context and updating the internal representation.

A canonical multi-turn framework can be described as:

Observation: full history of queries, tool calls (with arguments), and tool outputs.
Planning: select which tool to call next or whether to answer.
Tool Execution: call external API or function, receive and record output.
Termination: output the answer, possibly with supporting citations or structured responses.

In advanced settings, e.g., "From Tool Calling to Symbolic Thinking: LLMs in a Persistent Lisp Metaprogramming Loop" (Torre, 8 Jun 2025), agent-tool creation, versioning, and persistent stateful memory (via REPL) enable the model to invent new symbolic procedures and refine them across turns.

2. Training Algorithms: Supervised, Preference, and Reinforcement Learning

Training multi-turn tool-calling LLMs involves (i) supervised fine-tuning (SFT) on high-fidelity multi-turn trajectories, (ii) preference optimization, and (iii) outcome-driven RL.

Supervised approaches synthesize or collect multi-turn trajectories, where each step involves tool selection, argument filling, tool output parsing, and propagation into subsequent reasoning (Yin et al., 10 Mar 2025, Chen et al., 16 Oct 2024). The loss is typically next-token cross-entropy, possibly masked for assistant turns only.

Preference learning (e.g., Magnet (Yin et al., 10 Mar 2025)) contrasts positive trajectories (ground-truth tool usage hints) against negative/adversarial trajectories (error patterns injected as context), optimizing an mDPO-like objective that drives the model to prefer the correct sequence over misleading alternatives.

RL algorithms: Recent advances employ actor-critic style methods, especially Group Relative Policy Optimization (GRPO) (Kalyan et al., 28 Oct 2025, Singh et al., 28 Apr 2025). Here, batches of multi-turn trajectories are grouped; the agent’s parameters are updated to increase the likelihood of higher-reward rollouts relative to their groupmates, with KL regularization for stability. The reward function is outcome-centric, incorporating solution correctness, tool citation match, efficiency (fewer calls), and formatting penalties. This enables the agent to discover dynamic strategies, such as iterative refinement, multi-hop search, or error recovery in symbolic tool interaction.

3. Data Synthesis for Multi-Turn Function Calling

Curating high-quality multi-turn training data is essential (Xu et al., 28 Oct 2025, Chen et al., 16 Oct 2024, Yin et al., 10 Mar 2025). Synthesis frameworks like FunReason-MT (Xu et al., 28 Oct 2025), Magnet (Yin et al., 10 Mar 2025), and BUTTON (Chen et al., 16 Oct 2024) construct data with compositional trajectories, varied logical dependencies, and realistic tool usage.

Graph-based sampling: Build API relation graphs modeling inter-tool dependencies and simulate environment traversals constrained by prerequisite satisfaction.
Advanced query synthesis: Abstract chains of tool calls into composite "advanced tools," then generate reverse-engineered queries that force the use of such chains.
Multi-agent simulation: BUTTON creates complex scenarios and decomposes tasks bottom-up, synthesizes atomic and compositional subtasks, and then simulates multi-agent dialogue with rigorous turn-by-turn state and function definitions.
Guided iterative chain-of-thought: In FunReason-MT, agents engage in self-critique correction loops, validating function call traces against the API graph and revising their trajectories to enforce full logical coherence.

These frameworks demonstrate that targeted graph exploration and composition-centric tuning yield state-of-the-art multi-turn tool use accuracy (e.g., +40.75 pp multi-turn gain in FunReason-MT RL over base models (Xu et al., 28 Oct 2025)).

4. Memory Management and Scalability

Tool-calling agents require dynamic memory management—especially in environments with thousands of APIs and finite context windows (Lumer et al., 29 Jul 2025, Kate et al., 30 Apr 2025). The MemTool framework (Lumer et al., 29 Jul 2025) addresses efficient short-term memory in large-scale, multi-turn sessions:

Instantaneous Removal Ratio: $\mathrm{RemovalRatio} = \frac{\sum_{t=1}^T R_t}{\sum_{t=1}^T A_t}$ ; measures tool-pruning efficiency.
Sliding Window Metrics: Capture recent memory churn and average residual tool count.
Design modes: Autonomous Agent Mode delegates tool-add/remove decisions to the LLM; Workflow Mode deterministically prunes/searches up front; Hybrid Mode combines upfront pruning with dynamic recovery.

Top-performing LLMs in Autonomous/Hybrid modes maintain 90–94% removal ratio and average 5–10 tools in context, supporting 0.80–0.90 task completion accuracy over 100-turn sessions (Lumer et al., 29 Jul 2025). Efficient memory management is crucial for scaling multi-turn tool-calling without exhausting context or incurring excessive inference latency.

5. Empirical Results, Benchmarks, and Performance Degradation

Agent performance on multi-turn tool-calling is quantitatively evaluated on benchmarks such as BFCL v3/v4, ToolQuery, MINT, ScaleMCP, and in synthetic test environments (Kalyan et al., 28 Oct 2025, Xu et al., 28 Oct 2025, Kate et al., 30 Apr 2025, Wang et al., 2023). Key findings include:

RL-trained agents (Qwen3-14B+GRPO) achieve 85% accuracy on legal document search (vs. 78% for Gemini 2.5 Pro, 33% for naive RAG), with monotonic improvement as turn count increases past 6, revealing deep search strategies (Kalyan et al., 28 Oct 2025).
Magnified multi-turn gains: Magnet-14B-mDPO yields 68.01% BFCL-v3 success (vs. teacher Gemini-1.5-Pro-002’s 62.19%) and 73.3% ToolQuery success (Yin et al., 10 Mar 2025).
Disambiguation-centric finetuning (DiaFORGE) improves dynamic tool-call accuracy by 27–49 pp compared to leading closed-source models (Hathidara et al., 4 Jul 2025).
Long context is still a bottleneck: As catalog size and tool response length increase (to 128K tokens / 80K tokens), accuracy degrades by up to 85–91% (Kate et al., 30 Apr 2025). Multi-turn dialog length increases induce up to 68–95% drop in AST-match accuracy for all but the largest models.
RL, supervised instruction, and feedback: MINT (Wang et al., 2023) demonstrates 1–8% improvement per additional tool call and 2–17% absolute gain from natural-language feedback, but observes that SIFT/RLHF can decrease multi-turn capabilities unless multi-turn examples are explicitly included.

6. System Architectures and Symbolic Tool Integration

Tool invocation can be tightly coupled via explicit tagging, generative tokenization, or symbolic code emission. ToolGen (Wang et al., 4 Oct 2024) virtualizes each external tool as a vocabulary token, supporting unified retrieval and calling via a standard next-token prediction, achieving 92.7 NDCG@5 retrieval accuracy in a 47,000 tool space with no external index.

In symbolic and persistent settings (Torre, 8 Jun 2025), LLMs generate code snippets (e.g., Lisp expressions); a middleware intercepts, evaluates, and injects results into an interactive loop, with versioning and garbage collection for long-lived tool registries. This enables agentic introspection and dynamic tool evolution within multi-turn workflows.

Memory augmentation, pointer calls, chain-of-thought priors, and persistent REPLs represent current strategies for increasing agent robustness to long-horizon tasks, deep compositionality, and evolving toolsets.

7. Outstanding Challenges and Future Directions

Immediate research targets include:

Scaling multi-turn RL training to $T\gg 10$ (while alleviating reward sparsity) (Kalyan et al., 28 Oct 2025, Lumer et al., 29 Jul 2025).
Improving LLM sensitivity to early context and reducing "lost in the middle" degradation (Kate et al., 30 Apr 2025).
Automating discovery of new tools and dynamic tool-set adaptation (Wang et al., 4 Oct 2024).
Enabling robust uncertainty calibration (“I don’t know” abstention) under distributional shift.
Extending frameworks to multimodal/cross-lingual tools, persistent memory, and self-supervised agentic learning.

Multi-turn tool-calling LLMs now match or surpass closed-source models on agentic benchmarks, but further progress will depend on advances in scalable RL, efficient data synthesis, robust long-context handling, and memory-efficient architectures. These approaches are rapidly generalizing to database querying, code-to-text agents, multi-hop knowledge graph traversal, real-time enterprise orchestration, and autonomous symbolic programming.