FunReason-MT: Multi-Turn Function Calling

Updated 1 March 2026

FunReason-MT is a data synthesis and training framework designed to overcome the complexity of multi-turn function calling by generating realistic, structured tool-use trajectories.
The framework leverages an API Relation Graph and an advanced tool-query synthesis method to enforce dependencies and modular tool use in various tasks.
It incorporates a critic-driven, iterative chain-of-thought loop and per-task loss balancing, yielding significant performance gains in multi-turn scenarios.

FunReason-MT is a data synthesis and training framework for LLMs, specifically designed to overcome the complexity barrier in multi-turn function calling (FC) and to enable robust multi-task reasoning and tool use. It generalizes the FunReason paradigm by constructing realistic, deeply structured multi-turn tool-use trajectories and by providing a per-task, per-segment loss balancing scheme that can be scaled to diverse tasks, including API invocation, question answering, and summarization (Hao et al., 26 May 2025, Xu et al., 28 Oct 2025).

1. Background: The Challenge of Multi-Turn Function Calling

In contemporary LLM-based agents, function calling—the ability to interface with and invoke external APIs or tools—is critical for solving complex real-world problems. Earlier frameworks focused on single-turn FC, where the model produces a function call and (optionally) a chain of thought (CoT) in response to a one-shot query. However, most real-world tasks require multi-turn tool use with persistent state, dependencies among tool invocations, and context-dependent reasoning.

In this multi-turn scenario, each trajectory is represented as

$T = \{ (s_0,\;q_1, a_1, s_1, \ldots, q_n, a_n, s_n) \},$

where $s_i$ is the environment state after $i$ turns, $q_i$ is a (user or agent) query, and $a_i$ is the corresponding function call. Existing data synthesis strategies—random environment sampling and multi-agent role-play (MAS)—rarely generate the complex, long-tail, logically dependent test cases needed for robust FC. Consequently, models exhibit:

Lack of targeted model training (poor coverage of difficult or key tools);
Poor tool architecture isolation (failure to induce and compose modular subtools);
Fragile multi-turn logical dependency handling (Xu et al., 28 Oct 2025).

2. FunReason-MT Framework Design

FunReason-MT addresses these deficiencies via a top-down, three-stage data generation pipeline and an associated multi-task training objective.

2.1 Environment–API Graph Interactions

FunReason-MT constructs an explicit API Relation Graph,

$\mathcal{G} = (\mathcal{T},\;\mathcal{R},\;\mathcal{P}),$

where $\mathcal{T}$ is the set of basic tools, $\mathcal{R}$ encodes prerequisite dependencies between tools, and $\mathcal{P}$ specifies parameter schemas.

During synthesis, the system maintains the set of called tools $\mathcal{T}_{\mathrm{called}}$ , and only exposes as legal choices those tools for which all prerequisites are satisfied. Sampling is directed—if the aim is to guarantee that tool $T_a$ is eventually used, the mechanism prioritizes actions that minimize the graph distance to $T_a$ .

2.2 Advanced Tool–Query Synthesis

From a primitive multi-step tool-use trace, FunReason-MT synthesizes an "advanced tool" abstraction that encapsulates the composite operation $T_{\mathrm{adv}} = A_T(\mathrm{Turn}_i)$ , and generates a difficult query $Q_{\mathrm{hard}} = A_Q(T_{\mathrm{adv}},\epsilon)$ that forces invocation of the entire composed operation. This guarantees that the dataset includes scenarios with challenging logical jumps and cross-tool dependencies in the target domain.

2.3 Guided Iterative Chain-of-Thought Generation

Given these challenging queries, the framework incorporates a critic-driven, self-correcting generation loop. A Reasoning Agent proposes an initial CoT and function call; a validation step compares the output against the ground truth. If a failure occurs, a Critiquing Agent diagnoses the error, and a new iteration refines the reasoning chain and output. Only successful, fully validated traces are included in the final multi-turn dataset.

3. Multi-Task Extension: Loss Formulation and Training Loop

FunReason introduces a Self-Refinement Multiscale Loss (SRML) designed to balance the cross-entropy contributions of the reasoning (CoT) and function-call (FC) segments: $\mathcal{L}_{\text{MSL}} = a\,\mathcal{L}_{\text{think}} + B\,\mathcal{L}_{\text{result}}, \quad a + B = 1.$ This formulation can be generalized to a multi-task regime, with per-task per-segment coefficients,

$\mathcal{L}_{\text{MT}} = \sum_{i=1}^T w_i [ a_i\,\mathcal{L}^{(i)}_{\text{think}} + (1-a_i)\,\mathcal{L}^{(i)}_{\text{result}} ],$

where tasks may encompass different output formats (e.g., API call, free text, summary) with task weights $w_i$ and per-task balancing coefficients $a_i$ (Hao et al., 26 May 2025).

Each data domain is processed using a specialized Function Call Data Refinement pipeline (FCDR $_i$ ), which enforces output correctness and format as per FunReason's five-stage validation scheme (function-call classification, tool match, chain-of-thought validation, parameter inspection, formatting).

The self-refinement loop further enhances quality: after initial SFT with $\mathcal{L}_{\text{MT}}$ , new examples are generated on held-out data, validated/corrected with FCDR $_i$ , and used to continue fine-tuning.

4. Experimental Results

Experiments on the Berkeley Function-Calling Leaderboard v3 (BFCLv3) demonstrate significant gains:

Model	Multi-Turn	Single-Turn
Qwen3-4B-Inst-2507 (base)	15.75	78.19
+ FunReason-MT (SFT)	46.90	81.97
+ FunReason-MT (RL)	56.50	85.02

On the out-of-distribution BFCLv4 (Web Search, Memory tasks):

Model	Base	+ FunReason-MT (RL)
Qwen3-4B-Inst-2507	8.85	15.10

FunReason-MT (RL) outperforms open and closed-source baselines (GPT-5, Claude-Sonnet-4) in multi-turn FC and ranks first among models of comparable size. The largest reported accuracy gain occurs in the Multi-Turn regime, demonstrating that the guided, top-down sampling and correction loop is effective for complex tool-use scenarios (Xu et al., 28 Oct 2025).

5. Analysis, Strengths, and Limitations

FunReason-MT achieves its robustness via:

Explicit modelling of tool interdependencies using the API graph, enabling targeted coverage of challenging, compositional use cases;
Synthesis of high-level, composed tools to ensure the model learns and applies abstraction;
A critic-driven, iterative CoT self-correction loop, wherein only logically validated trajectories are included for training.

Limitations include reliance on simulated environments with predefined APIs and significant computational costs for iterative validation and correction. No formal statistical significance tests are reported for the gains in the original evaluations. The framework is extensible beyond function calling, with plausible extensions to multimodal task interfaces (GUI agents), online RL with live environment feedback, and new domains via recalibration or more granular scoring (Xu et al., 28 Oct 2025).

6. Potential Extensions and Connections

Remedy-R (Tan et al., 21 Dec 2025) discusses possible future integrations with FunReason-MT, including interactive evaluations (sub-score querying), multi-agent debates, discourse-level/structured scoring, and application to other generative tasks like summarization or style transfer ("FunReason-Summ" etc.). A plausible implication is that FunReason-MT's architecture and data curation methodologies may serve as a foundation for agentic evaluation and self-improvement pipelines in broader LLM ecosystems.

7. Significance in the Context of Agentic, Tool-Grounded LLMs

FunReason-MT establishes both practical benchmark datasets and a reproducible methodology for enabling agentic LLMs to robustly acquire, compose, and reason over multi-turn tool-use tasks. It demonstrates, for modest model sizes, that strategic data synthesis, per-segment and per-task training objectives, and iterative self-correction can bridge much of the gap to large closed-source models on emergent agentic capabilities. This suggests that future LLM agents requiring robust, compositional reasoning in real-world, tool-rich contexts will benefit from FunReason-MT’s principles and workflow (Hao et al., 26 May 2025, Xu et al., 28 Oct 2025).

Markdown Report Issue Upgrade to Chat

References (3)

FunReason: Enhancing Large Language Models' Function Calling via Self-Refinement Multiscale Loss and Automated Data Refinement (2025)

FunReason-MT Technical Report: Overcoming the Complexity Barrier in Multi-Turn Function Calling (2025)

Remedy-R: Generative Reasoning for Machine Translation Evaluation without Error Annotations (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FunReason-MT.