Synthetic Function-Calling Dataset

Updated 18 January 2026

Synthetic function-calling datasets are artificial corpora that systematically map natural language queries into structured, machine-readable function call representations.
They employ varied methodologies including instruction re-purposing, LLM-driven generation, and multi-agent simulation to cover diverse API paradigms and ensure rigorous verification.
These datasets enhance model robustness and multi-turn dialog capability by enabling precise, scalable training through compositional validation and augmentation techniques.

A synthetic function-calling dataset is an artificial corpus specifically crafted to train or evaluate LLMs on the mapping between natural-language user inputs and precise, machine-readable function call representations, typically within agentic or tool-augmented AI paradigms. These datasets act as surrogates for scarce or privacy-sensitive real user–API logs, enabling systematic coverage, fine-grained complexity control, and rigorous verification—attributes often lacking in naturally occurring data. The following sections survey the main approaches, design axes, representative methodologies, and empirical impacts of recent synthetic function-calling datasets.

1. Methodologies for Synthetic Function-Calling Data Generation

Synthetic function-calling datasets employ diverse methodologies, ranging from simple instruction reformatting of legacy corpora to sophisticated, multi-agent simulation with explicit verification. Key approaches include:

Instruction Re-purposing: Granite-FunctionCalling exemplifies the simplest approach, repurposing existing dialog/semantic parsing datasets (e.g., MultiWOZ, SNIPS) by re-instructing each example for function-calling and unifying the output schema to a structured JSON format. No new samples are generated; rather, instances are mapped into seven granular tasks such as nested calling, chaining, parallel calls, and parameter extraction (Abdelaziz et al., 2024).
Automated LLM-Routed Synthesis: Pipelines such as APIGen (Liu et al., 2024) sample real executable APIs and use large LLMs to generate candidate queries, enforcing correctness via three-stage format, execution, and semantic checks. Parallel efforts such as FunRL (Hao et al., 7 Aug 2025) introduce iterative LLM-based evaluation and abstract syntax tree (AST) validation, discarding noisy or ill-formed samples.
Multi-Agent Simulation and Environment Modeling: Advanced frameworks such as FunReason-MT (Xu et al., 28 Oct 2025), BUTTON (Chen et al., 2024), DICE-Bench (Jang et al., 28 Jun 2025), and ToolACE (Liu et al., 2024) utilize simulated dialogues among virtual users, agents, and tool APIs—often controlled by a scenario graph or API-dependency graph—to instantiate complex, multi-turn, or multi-party scenarios. Explicit graph-based orchestration enables logical dependency, long-context chaining, and fine-grained difficulty control.
Domain- or Modality-Specific Generation: Datasets such as mind_call target specific application domains—in this case, mental health use cases grounded in wearable sensor data—by enumerating domain-relevant function templates and exhaustively generating queries across explicit, implicit, behavioral, symptom-based, or metaphorical linguistic categories (Shafi et al., 11 Jan 2026).
Augmentation for Model Robustness: Hammer (Lin et al., 2024) and similar efforts introduce algorithmic irrelevance and masking augmentation, generating synthetic “no relevant call” examples and randomizing function or parameter names at train time to increase out-of-vocabulary robustness.

The table below summarizes distinguishing features of selected pipelines:

Dataset/Pipeline	Generation Mode	Verification / Filtering
Granite-FunctionCalling	Instruction re-purposing	None beyond schema unification
APIGen/xLAM	LLM generation	3-stage (format, execution, semantic)
FunRL	LLM generation, AST eval	Multi-pass LLM+AST filtering
FunReason(-MT)	CoT/LLM, iterative graph	Self-refinement, majority voting
BUTTON	Multi-agent simulation	Heuristics for task compositionality
DICE-Bench	Multi-party simulation	Automated and human dialogue checks
ToolACE	LLM evolution+simulation	Dual-layer (rule+LLM) validation
Hammer	xLAM + augmentation	Masking; irrelevance synthesis
mind_call	Domain enumeration	Manual normalization; chain-of-thought

2. Schema, Task Coverage, and Dataset Structures

Synthetic datasets typically standardize both the function call schema and the prompting interface. Key characteristics include:

Function Definition Schemas: Most datasets use JSON schemas to express function signatures, including name, description, and arguments (typed fields with optional, required or nested structures). These are injected verbatim as “library” prompts in each example to teach model-to-API grounding (Abdelaziz et al., 2024, Liu et al., 2024, Zeng et al., 2024).
Canonical Output Formats: Output schemas obligate the model to produce objects such as:
1
{ "name": "function_name", "arguments": { "param1": value, ... } }
or their equivalents in directed acyclic plan representations, especially for sequential and parallel call compositions (TinyAgent (Erdogan et al., 2024)).
Granular Task Types: Synthetic datasets often target orthogonal capabilities, including:
- Nested function calling (output of one as argument to another)
- Flat function chaining (ordered, non-nested invocations)
- Parallel/independent calls
- Function/tool selection (from long candidate lists)
- Parameter extraction and slot-filling (from noisy natural input)
- No-call/irrelevance handling (when no candidate fits; Hammer, ToolACE)
- Multi-turn dialogue, with reasoning and logical dependencies (FunReason-MT, BUTTON)
- Domain-specific mapping (mind_call: temporal normalization, category grounding)
Instruction Styles and Context: Prompt schemas encode both the function candidates and relevant instructions, with system prompts specifying “call the appropriate function in JSON given the following API library…”

Example synthetic (parallel-multiple) entry (Liu et al., 2024):

{
  "query": "Get stock prices for AAPL and GOOGL today; also translate 'Hello' to French.",
  "tools": [ ... ],
  "answers": [
    {"name": "finance.get_stock_price", "arguments": {"symbol": "AAPL", "date": "2024-06-01"}},
    {"name": "finance.get_stock_price", "arguments": {"symbol": "GOOGL", "date": "2024-06-01"}},
    {"name": "ml.translate", "arguments": {"text": "Hello", "to": "fr"}}
  ]
}

3. Verification, Quality Assurance, and Diversity Control

Synthetic function-calling corpora leverage multi-layered verification to maximize data correctness, coverage, and variance:

Hierarchical Filtering: APIGen (Liu et al., 2024) and FunRL (Hao et al., 7 Aug 2025) use format checking (JSON validity), live execution of candidate function calls (with error trapping), and semantic validation (via LLM “judge” prompts). FunRL further analyzes each sample’s AST, discarding any with malformed or mismatched trees.
Dynamic Augmentation and Masking: Automatic function and parameter name masking encourages reliance on function descriptions over memorization, yielding 4–5× data diversity (ToolPRM (Lin et al., 16 Oct 2025), Hammer (Lin et al., 2024)).
Compositionality Enforcement: BUTTON (Chen et al., 2024) heuristically ensures that multi-turn, multi-call trajectories are logically consistent by decomposing higher-order tasks into atomic subgoals, with LLM-based verification on compositional validity.
Dual-Layer Validation: ToolACE (Liu et al., 2024) automatically enforces syntactic correctness (rule layer) and then uses a reasoning LLM to probe for hallucinated or inconsistent argument values (model layer), discarding samples at either stage.
Empirical Distribution Alignment: RouteNator (Belavadi et al., 15 May 2025) targets close statistical alignment between synthetic and real-world query/parameter distributions, tuning its router algorithm by minimizing KL divergence and earth mover’s distance on features like query length and API frequency.
Complexity Metrics and Diversity Balancing: Datasets report statistics on call depth, argument count, parameter type frequency, and logical-dependency depth (FunReason-MT: mean logical depth 3.4, 10 000 multi-turn trajectories (Xu et al., 28 Oct 2025); ToolACE: explicit “complexity” stratification).

4. Scale, Domain Breadth, and Representativity

The scale and coverage of published synthetic datasets varies considerably:

Dataset/Pipeline	Size (# examples)	APIs / Tools	Domains (examples)	Multi-turn	Irrelevance
Granite-FunctionCalling	142,000	~3,700	General (API-BLEND, Glaive)	Limited	No
APIGen/xLAM	60,000	3,673	21 (balanced)	Single-turn	No
FunRL	58,759	~2,350 sigs	General (xLAM)	No	No
ToolPRM	5.7M (process)	Masked	General xLAM + Hammer	N/A	Yes
FunReason-FCDR	60,000	3,673	21	Reasoning	No
FunReason-MT	10,000	120	Simulation graph	Yes	Indirect
BUTTONInstruct	8,000	Synthesized	Compositional/Atomic	Yes	No
ToolACE (full)	500,000+ dialogs	26,507	Hierarchical taxonomy	Extensive	Yes
RouteNator	215,100	Content APIs	Design/Creative	No	N/A
DICE-Bench	1,607	124	Multi-party, dialogue	Yes	N/A
mind_call	50,000	7	Wearable/mental-health	No	N/A
TinyAgent	82,000	16 (MacOS)	AppleScript workflows	Yes	No
CallNavi	729	579	10 (banking, HR, etc.)	Yes	No
Excel FTₛᵧₙ–QA	6,440	100	Excel formulae (QA/Table)	N/A	N/A

Synthetic datasets span real APIs (APIGen, ToolACE, FunReason), exhaustively generated/sampled API signatures (ToolACE, CallNavi), and domain-specific workflows (mind_call, TinyAgent).

5. Impact, Evaluation, and Empirical Findings

Synthetic datasets are foundational to state-of-the-art model development and benchmarking. Empirical studies report:

Benchmark Leadership: ToolACE-8B achieved 91.41% overall accuracy on BFCL-v1, surpassing GPT-4 preview and Claude-3.5; xLAM-7B matches or exceeds GPT-4-FC on the same benchmark (Liu et al., 2024, Liu et al., 2024).
Model Robustness: Masked/irrelevance-augmented sets (Hammer, ToolPRM) increase generalization to unseen toolsets and enable robust “no call” decisions (Lin et al., 2024, Lin et al., 16 Oct 2025).
Multi-Turn Competence: Datasets supporting multi-turn/planning dialog (FunReason-MT, BUTTON, DICE-Bench) yield 30–40 percentage point gains in multi-step agentic tasks compared to SFT on single-turn data (Xu et al., 28 Oct 2025, Chen et al., 2024, Jang et al., 28 Jun 2025).
Quality–Diversity Tradeoffs: Empirical ablations show that dual-layer and compositionality validation can increase end-task accuracy by 4–6 points over rule-based filters alone (ToolACE).
Domain Adaptation: Enterprise scenario datasets (Zeng et al., 2024), Excel (McKenna et al., 24 Mar 2025), and mind_call (mental health (Shafi et al., 11 Jan 2026)) confirm that synthetic methodologies generalize beyond standard information APIs.

6. Limitations, Open Problems, and Best Practices

Despite demonstrated utility, synthetic datasets have limitations:

Absence of Real User-Domain Drift: Methods relying on LLM self-play or simulation may fail to represent open-world or adversarial user behaviors, even with sophisticated prompt variation and augmentation (explicitly acknowledged by FunReason-MT and BUTTON).
Long-Chain and Multi-Modality Gaps: Few synthetic corpora exhibit realistic very-long dependency chains (>10 turns), multimodal context (image/audio), or dynamic tool sets (noted as a future direction by FunReason-MT, RouteNator).
Human Validation Scarcity: Most datasets rely on LLM self-checking or majority voting for sample acceptance; only DICE-Bench and FunReason-MT report systematic human review components.
Per-Task Granularity: Only a minority of pipelines report per-task or per-domain splits, limiting ablation and task-specific curriculum design [caveated in (Abdelaziz et al., 2024)].

Best practices emerging from the literature include:

Multi-stage, hierarchical validation (format, execution, semantic)
Function and parameter masking for description-based generalization
Balanced inclusion of single-call, multi-call, and irrelevance/no-call data
Coverage stratification across API domains, parameter types, and reasoning types
Human or LLM-based abductive dialogs for complex scenario simulation

7. Representative Examples and Schema Illustration

Below are representative examples demonstrating the breadth of synthetic function-calling data:

Granite-FunctionCalling (Nested Call Example) (Abdelaziz et al., 2024):

{
  "input": "What's the typical driving time between Las Vegas and the Grand Canyon?",
  "output": "<function_call> { \"name\": \"get_location\", \"arguments\": {\"point_on_map\": \"the Grand Canyon\"} } <function_call> { \"name\": \"get_estimated_duration\", \"arguments\": { \"source\": \"Las Vegas\", \"method_travel\": \"driving\", \"destination\": \"<function_response>get_location\" } }"
}

FunReason-MT (Multi-Turn Trajectory, Env–API Graph) (Xu et al., 28 Oct 2025): | Statistic | Value | |--------------------------|---------------------------| | Trajectories | 10,000 | | Avg. turns/trajectory | 5.3 (σ=1.8) | | API coverage | 100 % (120 tools) | | Logical dep. depth | 3.4 (σ=1.1) |

ToolACE (Compositional Parallel Calls) (Liu et al., 2024):

User: “Tell me upcoming Theatre, Dance, and Music events between 2021-04-01 and 2021-05-01.”
Assistant: Three calls to performanceArt.get_upcoming_events with category adjusted.

mind_call (Normalized, Reasoned API Invocation) (Shafi et al., 11 Jan 2026):

{
  "name": "get_sleep_data",
  "arguments": { "user_id": "user_12345", "numdays": 7 }
}

Explicit reasoning: maps "recently" to numdays=7.

In summary, synthetic function-calling datasets constitute a foundational methodology for equipping LLMs with robust, high-accuracy tool-use and planning capacities. The field has evolved from simple instruction reformatting to large-scale, compositional, and multi-turn simulation with tightly integrated verification. Ongoing work continues to expand complexity, domain breadth, and the representational diversity required for agentic AI in open-world, real-world applications (Abdelaziz et al., 2024, Liu et al., 2024, Lin et al., 2024, Xu et al., 28 Oct 2025, Chen et al., 2024, Liu et al., 2024, Hao et al., 7 Aug 2025, Shafi et al., 11 Jan 2026).

Markdown Upgrade to Chat

References (14)

Granite-Function Calling Model: Introducing Function Calling Abilities via Multi-task Learning of Granular Tasks (2024)

APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets (2024)

Exploring Superior Function Calls via Reinforcement Learning (2025)

FunReason-MT Technical Report: Overcoming the Complexity Barrier in Multi-Turn Function Calling (2025)

Facilitating Multi-turn Function Calling for LLMs via Compositional Instruction Tuning (2024)

DICE-BENCH: Evaluating the Tool-Use Capabilities of Large Language Models in Multi-Round, Multi-Party Dialogues (2025)

ToolACE: Winning the Points of LLM Function Calling (2024)

mind_call: A Dataset for Mental Health Function Calling with Large Language Models (2026)

Hammer: Robust Function-Calling for On-Device Language Models via Function Masking (2024)

10.

Adaptable and Precise: Enterprise-Scenario LLM Function-Calling Capability Training Pipeline (2024)

11.

TinyAgent: Function Calling at the Edge (2024)

12.

ToolPRM: Fine-Grained Inference Scaling of Structured Outputs for Function Calling (2025)

13.

RouteNator: A Router-Based Multi-Modal Architecture for Generating Synthetic Training Data for Function Calling LLMs (2025)

14.

Synthetic Function Demonstrations Improve Generation in Low-Resource Programming Languages (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Synthetic Function-Calling Dataset.