Papers
Topics
Authors
Recent
Search
2000 character limit reached

FunReason-MT: Multi-Turn Function Calling

Updated 1 March 2026
  • FunReason-MT is a data synthesis and training framework designed to overcome the complexity of multi-turn function calling by generating realistic, structured tool-use trajectories.
  • The framework leverages an API Relation Graph and an advanced tool-query synthesis method to enforce dependencies and modular tool use in various tasks.
  • It incorporates a critic-driven, iterative chain-of-thought loop and per-task loss balancing, yielding significant performance gains in multi-turn scenarios.

FunReason-MT is a data synthesis and training framework for LLMs, specifically designed to overcome the complexity barrier in multi-turn function calling (FC) and to enable robust multi-task reasoning and tool use. It generalizes the FunReason paradigm by constructing realistic, deeply structured multi-turn tool-use trajectories and by providing a per-task, per-segment loss balancing scheme that can be scaled to diverse tasks, including API invocation, question answering, and summarization (Hao et al., 26 May 2025, Xu et al., 28 Oct 2025).

1. Background: The Challenge of Multi-Turn Function Calling

In contemporary LLM-based agents, function calling—the ability to interface with and invoke external APIs or tools—is critical for solving complex real-world problems. Earlier frameworks focused on single-turn FC, where the model produces a function call and (optionally) a chain of thought (CoT) in response to a one-shot query. However, most real-world tasks require multi-turn tool use with persistent state, dependencies among tool invocations, and context-dependent reasoning.

In this multi-turn scenario, each trajectory is represented as

T={(s0,  q1,a1,s1,…,qn,an,sn)},T = \{ (s_0,\;q_1, a_1, s_1, \ldots, q_n, a_n, s_n) \},

where sis_i is the environment state after ii turns, qiq_i is a (user or agent) query, and aia_i is the corresponding function call. Existing data synthesis strategies—random environment sampling and multi-agent role-play (MAS)—rarely generate the complex, long-tail, logically dependent test cases needed for robust FC. Consequently, models exhibit:

  • Lack of targeted model training (poor coverage of difficult or key tools);
  • Poor tool architecture isolation (failure to induce and compose modular subtools);
  • Fragile multi-turn logical dependency handling (Xu et al., 28 Oct 2025).

2. FunReason-MT Framework Design

FunReason-MT addresses these deficiencies via a top-down, three-stage data generation pipeline and an associated multi-task training objective.

2.1 Environment–API Graph Interactions

FunReason-MT constructs an explicit API Relation Graph,

G=(T,  R,  P),\mathcal{G} = (\mathcal{T},\;\mathcal{R},\;\mathcal{P}),

where T\mathcal{T} is the set of basic tools, R\mathcal{R} encodes prerequisite dependencies between tools, and P\mathcal{P} specifies parameter schemas.

During synthesis, the system maintains the set of called tools Tcalled\mathcal{T}_{\mathrm{called}}, and only exposes as legal choices those tools for which all prerequisites are satisfied. Sampling is directed—if the aim is to guarantee that tool TaT_a is eventually used, the mechanism prioritizes actions that minimize the graph distance to TaT_a.

2.2 Advanced Tool–Query Synthesis

From a primitive multi-step tool-use trace, FunReason-MT synthesizes an "advanced tool" abstraction that encapsulates the composite operation Tadv=AT(Turni)T_{\mathrm{adv}} = A_T(\mathrm{Turn}_i), and generates a difficult query Qhard=AQ(Tadv,ϵ)Q_{\mathrm{hard}} = A_Q(T_{\mathrm{adv}},\epsilon) that forces invocation of the entire composed operation. This guarantees that the dataset includes scenarios with challenging logical jumps and cross-tool dependencies in the target domain.

2.3 Guided Iterative Chain-of-Thought Generation

Given these challenging queries, the framework incorporates a critic-driven, self-correcting generation loop. A Reasoning Agent proposes an initial CoT and function call; a validation step compares the output against the ground truth. If a failure occurs, a Critiquing Agent diagnoses the error, and a new iteration refines the reasoning chain and output. Only successful, fully validated traces are included in the final multi-turn dataset.

3. Multi-Task Extension: Loss Formulation and Training Loop

FunReason introduces a Self-Refinement Multiscale Loss (SRML) designed to balance the cross-entropy contributions of the reasoning (CoT) and function-call (FC) segments: LMSL=a Lthink+B Lresult,a+B=1.\mathcal{L}_{\text{MSL}} = a\,\mathcal{L}_{\text{think}} + B\,\mathcal{L}_{\text{result}}, \quad a + B = 1. This formulation can be generalized to a multi-task regime, with per-task per-segment coefficients,

LMT=∑i=1Twi[ai Lthink(i)+(1−ai) Lresult(i)],\mathcal{L}_{\text{MT}} = \sum_{i=1}^T w_i [ a_i\,\mathcal{L}^{(i)}_{\text{think}} + (1-a_i)\,\mathcal{L}^{(i)}_{\text{result}} ],

where tasks may encompass different output formats (e.g., API call, free text, summary) with task weights wiw_i and per-task balancing coefficients aia_i (Hao et al., 26 May 2025).

Each data domain is processed using a specialized Function Call Data Refinement pipeline (FCDRi_i), which enforces output correctness and format as per FunReason's five-stage validation scheme (function-call classification, tool match, chain-of-thought validation, parameter inspection, formatting).

The self-refinement loop further enhances quality: after initial SFT with LMT\mathcal{L}_{\text{MT}}, new examples are generated on held-out data, validated/corrected with FCDRi_i, and used to continue fine-tuning.

4. Experimental Results

Experiments on the Berkeley Function-Calling Leaderboard v3 (BFCLv3) demonstrate significant gains:

Model Multi-Turn Single-Turn
Qwen3-4B-Inst-2507 (base) 15.75 78.19
+ FunReason-MT (SFT) 46.90 81.97
+ FunReason-MT (RL) 56.50 85.02

On the out-of-distribution BFCLv4 (Web Search, Memory tasks):

Model Base + FunReason-MT (RL)
Qwen3-4B-Inst-2507 8.85 15.10

FunReason-MT (RL) outperforms open and closed-source baselines (GPT-5, Claude-Sonnet-4) in multi-turn FC and ranks first among models of comparable size. The largest reported accuracy gain occurs in the Multi-Turn regime, demonstrating that the guided, top-down sampling and correction loop is effective for complex tool-use scenarios (Xu et al., 28 Oct 2025).

5. Analysis, Strengths, and Limitations

FunReason-MT achieves its robustness via:

  • Explicit modelling of tool interdependencies using the API graph, enabling targeted coverage of challenging, compositional use cases;
  • Synthesis of high-level, composed tools to ensure the model learns and applies abstraction;
  • A critic-driven, iterative CoT self-correction loop, wherein only logically validated trajectories are included for training.

Limitations include reliance on simulated environments with predefined APIs and significant computational costs for iterative validation and correction. No formal statistical significance tests are reported for the gains in the original evaluations. The framework is extensible beyond function calling, with plausible extensions to multimodal task interfaces (GUI agents), online RL with live environment feedback, and new domains via recalibration or more granular scoring (Xu et al., 28 Oct 2025).

6. Potential Extensions and Connections

Remedy-R (Tan et al., 21 Dec 2025) discusses possible future integrations with FunReason-MT, including interactive evaluations (sub-score querying), multi-agent debates, discourse-level/structured scoring, and application to other generative tasks like summarization or style transfer ("FunReason-Summ" etc.). A plausible implication is that FunReason-MT's architecture and data curation methodologies may serve as a foundation for agentic evaluation and self-improvement pipelines in broader LLM ecosystems.

7. Significance in the Context of Agentic, Tool-Grounded LLMs

FunReason-MT establishes both practical benchmark datasets and a reproducible methodology for enabling agentic LLMs to robustly acquire, compose, and reason over multi-turn tool-use tasks. It demonstrates, for modest model sizes, that strategic data synthesis, per-segment and per-task training objectives, and iterative self-correction can bridge much of the gap to large closed-source models on emergent agentic capabilities. This suggests that future LLM agents requiring robust, compositional reasoning in real-world, tool-rich contexts will benefit from FunReason-MT’s principles and workflow (Hao et al., 26 May 2025, Xu et al., 28 Oct 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FunReason-MT.