FunReason-MT: Multi-Turn Function Calling
- FunReason-MT is a data synthesis and training framework designed to overcome the complexity of multi-turn function calling by generating realistic, structured tool-use trajectories.
- The framework leverages an API Relation Graph and an advanced tool-query synthesis method to enforce dependencies and modular tool use in various tasks.
- It incorporates a critic-driven, iterative chain-of-thought loop and per-task loss balancing, yielding significant performance gains in multi-turn scenarios.
FunReason-MT is a data synthesis and training framework for LLMs, specifically designed to overcome the complexity barrier in multi-turn function calling (FC) and to enable robust multi-task reasoning and tool use. It generalizes the FunReason paradigm by constructing realistic, deeply structured multi-turn tool-use trajectories and by providing a per-task, per-segment loss balancing scheme that can be scaled to diverse tasks, including API invocation, question answering, and summarization (Hao et al., 26 May 2025, Xu et al., 28 Oct 2025).
1. Background: The Challenge of Multi-Turn Function Calling
In contemporary LLM-based agents, function calling—the ability to interface with and invoke external APIs or tools—is critical for solving complex real-world problems. Earlier frameworks focused on single-turn FC, where the model produces a function call and (optionally) a chain of thought (CoT) in response to a one-shot query. However, most real-world tasks require multi-turn tool use with persistent state, dependencies among tool invocations, and context-dependent reasoning.
In this multi-turn scenario, each trajectory is represented as
where is the environment state after turns, is a (user or agent) query, and is the corresponding function call. Existing data synthesis strategies—random environment sampling and multi-agent role-play (MAS)—rarely generate the complex, long-tail, logically dependent test cases needed for robust FC. Consequently, models exhibit:
- Lack of targeted model training (poor coverage of difficult or key tools);
- Poor tool architecture isolation (failure to induce and compose modular subtools);
- Fragile multi-turn logical dependency handling (Xu et al., 28 Oct 2025).
2. FunReason-MT Framework Design
FunReason-MT addresses these deficiencies via a top-down, three-stage data generation pipeline and an associated multi-task training objective.
2.1 Environment–API Graph Interactions
FunReason-MT constructs an explicit API Relation Graph,
where is the set of basic tools, encodes prerequisite dependencies between tools, and specifies parameter schemas.
During synthesis, the system maintains the set of called tools , and only exposes as legal choices those tools for which all prerequisites are satisfied. Sampling is directed—if the aim is to guarantee that tool is eventually used, the mechanism prioritizes actions that minimize the graph distance to .
2.2 Advanced Tool–Query Synthesis
From a primitive multi-step tool-use trace, FunReason-MT synthesizes an "advanced tool" abstraction that encapsulates the composite operation , and generates a difficult query that forces invocation of the entire composed operation. This guarantees that the dataset includes scenarios with challenging logical jumps and cross-tool dependencies in the target domain.
2.3 Guided Iterative Chain-of-Thought Generation
Given these challenging queries, the framework incorporates a critic-driven, self-correcting generation loop. A Reasoning Agent proposes an initial CoT and function call; a validation step compares the output against the ground truth. If a failure occurs, a Critiquing Agent diagnoses the error, and a new iteration refines the reasoning chain and output. Only successful, fully validated traces are included in the final multi-turn dataset.
3. Multi-Task Extension: Loss Formulation and Training Loop
FunReason introduces a Self-Refinement Multiscale Loss (SRML) designed to balance the cross-entropy contributions of the reasoning (CoT) and function-call (FC) segments: This formulation can be generalized to a multi-task regime, with per-task per-segment coefficients,
where tasks may encompass different output formats (e.g., API call, free text, summary) with task weights and per-task balancing coefficients (Hao et al., 26 May 2025).
Each data domain is processed using a specialized Function Call Data Refinement pipeline (FCDR), which enforces output correctness and format as per FunReason's five-stage validation scheme (function-call classification, tool match, chain-of-thought validation, parameter inspection, formatting).
The self-refinement loop further enhances quality: after initial SFT with , new examples are generated on held-out data, validated/corrected with FCDR, and used to continue fine-tuning.
4. Experimental Results
Experiments on the Berkeley Function-Calling Leaderboard v3 (BFCLv3) demonstrate significant gains:
| Model | Multi-Turn | Single-Turn |
|---|---|---|
| Qwen3-4B-Inst-2507 (base) | 15.75 | 78.19 |
| + FunReason-MT (SFT) | 46.90 | 81.97 |
| + FunReason-MT (RL) | 56.50 | 85.02 |
On the out-of-distribution BFCLv4 (Web Search, Memory tasks):
| Model | Base | + FunReason-MT (RL) |
|---|---|---|
| Qwen3-4B-Inst-2507 | 8.85 | 15.10 |
FunReason-MT (RL) outperforms open and closed-source baselines (GPT-5, Claude-Sonnet-4) in multi-turn FC and ranks first among models of comparable size. The largest reported accuracy gain occurs in the Multi-Turn regime, demonstrating that the guided, top-down sampling and correction loop is effective for complex tool-use scenarios (Xu et al., 28 Oct 2025).
5. Analysis, Strengths, and Limitations
FunReason-MT achieves its robustness via:
- Explicit modelling of tool interdependencies using the API graph, enabling targeted coverage of challenging, compositional use cases;
- Synthesis of high-level, composed tools to ensure the model learns and applies abstraction;
- A critic-driven, iterative CoT self-correction loop, wherein only logically validated trajectories are included for training.
Limitations include reliance on simulated environments with predefined APIs and significant computational costs for iterative validation and correction. No formal statistical significance tests are reported for the gains in the original evaluations. The framework is extensible beyond function calling, with plausible extensions to multimodal task interfaces (GUI agents), online RL with live environment feedback, and new domains via recalibration or more granular scoring (Xu et al., 28 Oct 2025).
6. Potential Extensions and Connections
Remedy-R (Tan et al., 21 Dec 2025) discusses possible future integrations with FunReason-MT, including interactive evaluations (sub-score querying), multi-agent debates, discourse-level/structured scoring, and application to other generative tasks like summarization or style transfer ("FunReason-Summ" etc.). A plausible implication is that FunReason-MT's architecture and data curation methodologies may serve as a foundation for agentic evaluation and self-improvement pipelines in broader LLM ecosystems.
7. Significance in the Context of Agentic, Tool-Grounded LLMs
FunReason-MT establishes both practical benchmark datasets and a reproducible methodology for enabling agentic LLMs to robustly acquire, compose, and reason over multi-turn tool-use tasks. It demonstrates, for modest model sizes, that strategic data synthesis, per-segment and per-task training objectives, and iterative self-correction can bridge much of the gap to large closed-source models on emergent agentic capabilities. This suggests that future LLM agents requiring robust, compositional reasoning in real-world, tool-rich contexts will benefit from FunReason-MT’s principles and workflow (Hao et al., 26 May 2025, Xu et al., 28 Oct 2025).