Papers
Topics
Authors
Recent
2000 character limit reached

Synthetic Function-Calling Dataset

Updated 18 January 2026
  • Synthetic function-calling datasets are artificial corpora that systematically map natural language queries into structured, machine-readable function call representations.
  • They employ varied methodologies including instruction re-purposing, LLM-driven generation, and multi-agent simulation to cover diverse API paradigms and ensure rigorous verification.
  • These datasets enhance model robustness and multi-turn dialog capability by enabling precise, scalable training through compositional validation and augmentation techniques.

A synthetic function-calling dataset is an artificial corpus specifically crafted to train or evaluate LLMs on the mapping between natural-language user inputs and precise, machine-readable function call representations, typically within agentic or tool-augmented AI paradigms. These datasets act as surrogates for scarce or privacy-sensitive real user–API logs, enabling systematic coverage, fine-grained complexity control, and rigorous verification—attributes often lacking in naturally occurring data. The following sections survey the main approaches, design axes, representative methodologies, and empirical impacts of recent synthetic function-calling datasets.

1. Methodologies for Synthetic Function-Calling Data Generation

Synthetic function-calling datasets employ diverse methodologies, ranging from simple instruction reformatting of legacy corpora to sophisticated, multi-agent simulation with explicit verification. Key approaches include:

  • Instruction Re-purposing: Granite-FunctionCalling exemplifies the simplest approach, repurposing existing dialog/semantic parsing datasets (e.g., MultiWOZ, SNIPS) by re-instructing each example for function-calling and unifying the output schema to a structured JSON format. No new samples are generated; rather, instances are mapped into seven granular tasks such as nested calling, chaining, parallel calls, and parameter extraction (Abdelaziz et al., 2024).
  • Automated LLM-Routed Synthesis: Pipelines such as APIGen (Liu et al., 2024) sample real executable APIs and use large LLMs to generate candidate queries, enforcing correctness via three-stage format, execution, and semantic checks. Parallel efforts such as FunRL (Hao et al., 7 Aug 2025) introduce iterative LLM-based evaluation and abstract syntax tree (AST) validation, discarding noisy or ill-formed samples.
  • Multi-Agent Simulation and Environment Modeling: Advanced frameworks such as FunReason-MT (Xu et al., 28 Oct 2025), BUTTON (Chen et al., 2024), DICE-Bench (Jang et al., 28 Jun 2025), and ToolACE (Liu et al., 2024) utilize simulated dialogues among virtual users, agents, and tool APIs—often controlled by a scenario graph or API-dependency graph—to instantiate complex, multi-turn, or multi-party scenarios. Explicit graph-based orchestration enables logical dependency, long-context chaining, and fine-grained difficulty control.
  • Domain- or Modality-Specific Generation: Datasets such as mind_call target specific application domains—in this case, mental health use cases grounded in wearable sensor data—by enumerating domain-relevant function templates and exhaustively generating queries across explicit, implicit, behavioral, symptom-based, or metaphorical linguistic categories (Shafi et al., 11 Jan 2026).
  • Augmentation for Model Robustness: Hammer (Lin et al., 2024) and similar efforts introduce algorithmic irrelevance and masking augmentation, generating synthetic “no relevant call” examples and randomizing function or parameter names at train time to increase out-of-vocabulary robustness.

The table below summarizes distinguishing features of selected pipelines:

Dataset/Pipeline Generation Mode Verification / Filtering
Granite-FunctionCalling Instruction re-purposing None beyond schema unification
APIGen/xLAM LLM generation 3-stage (format, execution, semantic)
FunRL LLM generation, AST eval Multi-pass LLM+AST filtering
FunReason(-MT) CoT/LLM, iterative graph Self-refinement, majority voting
BUTTON Multi-agent simulation Heuristics for task compositionality
DICE-Bench Multi-party simulation Automated and human dialogue checks
ToolACE LLM evolution+simulation Dual-layer (rule+LLM) validation
Hammer xLAM + augmentation Masking; irrelevance synthesis
mind_call Domain enumeration Manual normalization; chain-of-thought

2. Schema, Task Coverage, and Dataset Structures

Synthetic datasets typically standardize both the function call schema and the prompting interface. Key characteristics include:

  • Function Definition Schemas: Most datasets use JSON schemas to express function signatures, including name, description, and arguments (typed fields with optional, required or nested structures). These are injected verbatim as “library” prompts in each example to teach model-to-API grounding (Abdelaziz et al., 2024, Liu et al., 2024, Zeng et al., 2024).
  • Canonical Output Formats: Output schemas obligate the model to produce objects such as:
    1
    
    { "name": "function_name", "arguments": { "param1": value, ... } }
    or their equivalents in directed acyclic plan representations, especially for sequential and parallel call compositions (TinyAgent (Erdogan et al., 2024)).
  • Granular Task Types: Synthetic datasets often target orthogonal capabilities, including:
    • Nested function calling (output of one as argument to another)
    • Flat function chaining (ordered, non-nested invocations)
    • Parallel/independent calls
    • Function/tool selection (from long candidate lists)
    • Parameter extraction and slot-filling (from noisy natural input)
    • No-call/irrelevance handling (when no candidate fits; Hammer, ToolACE)
    • Multi-turn dialogue, with reasoning and logical dependencies (FunReason-MT, BUTTON)
    • Domain-specific mapping (mind_call: temporal normalization, category grounding)
  • Instruction Styles and Context: Prompt schemas encode both the function candidates and relevant instructions, with system prompts specifying “call the appropriate function in JSON given the following API library…”

Example synthetic (parallel-multiple) entry (Liu et al., 2024):

1
2
3
4
5
6
7
8
9
{
  "query": "Get stock prices for AAPL and GOOGL today; also translate 'Hello' to French.",
  "tools": [ ... ],
  "answers": [
    {"name": "finance.get_stock_price", "arguments": {"symbol": "AAPL", "date": "2024-06-01"}},
    {"name": "finance.get_stock_price", "arguments": {"symbol": "GOOGL", "date": "2024-06-01"}},
    {"name": "ml.translate", "arguments": {"text": "Hello", "to": "fr"}}
  ]
}

3. Verification, Quality Assurance, and Diversity Control

Synthetic function-calling corpora leverage multi-layered verification to maximize data correctness, coverage, and variance:

  • Hierarchical Filtering: APIGen (Liu et al., 2024) and FunRL (Hao et al., 7 Aug 2025) use format checking (JSON validity), live execution of candidate function calls (with error trapping), and semantic validation (via LLM “judge” prompts). FunRL further analyzes each sample’s AST, discarding any with malformed or mismatched trees.
  • Dynamic Augmentation and Masking: Automatic function and parameter name masking encourages reliance on function descriptions over memorization, yielding 4–5× data diversity (ToolPRM (Lin et al., 16 Oct 2025), Hammer (Lin et al., 2024)).
  • Compositionality Enforcement: BUTTON (Chen et al., 2024) heuristically ensures that multi-turn, multi-call trajectories are logically consistent by decomposing higher-order tasks into atomic subgoals, with LLM-based verification on compositional validity.
  • Dual-Layer Validation: ToolACE (Liu et al., 2024) automatically enforces syntactic correctness (rule layer) and then uses a reasoning LLM to probe for hallucinated or inconsistent argument values (model layer), discarding samples at either stage.
  • Empirical Distribution Alignment: RouteNator (Belavadi et al., 15 May 2025) targets close statistical alignment between synthetic and real-world query/parameter distributions, tuning its router algorithm by minimizing KL divergence and earth mover’s distance on features like query length and API frequency.
  • Complexity Metrics and Diversity Balancing: Datasets report statistics on call depth, argument count, parameter type frequency, and logical-dependency depth (FunReason-MT: mean logical depth 3.4, 10 000 multi-turn trajectories (Xu et al., 28 Oct 2025); ToolACE: explicit “complexity” stratification).

4. Scale, Domain Breadth, and Representativity

The scale and coverage of published synthetic datasets varies considerably:

Dataset/Pipeline Size (# examples) APIs / Tools Domains (examples) Multi-turn Irrelevance
Granite-FunctionCalling 142,000 ~3,700 General (API-BLEND, Glaive) Limited No
APIGen/xLAM 60,000 3,673 21 (balanced) Single-turn No
FunRL 58,759 ~2,350 sigs General (xLAM) No No
ToolPRM 5.7M (process) Masked General xLAM + Hammer N/A Yes
FunReason-FCDR 60,000 3,673 21 Reasoning No
FunReason-MT 10,000 120 Simulation graph Yes Indirect
BUTTONInstruct 8,000 Synthesized Compositional/Atomic Yes No
ToolACE (full) 500,000+ dialogs 26,507 Hierarchical taxonomy Extensive Yes
RouteNator 215,100 Content APIs Design/Creative No N/A
DICE-Bench 1,607 124 Multi-party, dialogue Yes N/A
mind_call 50,000 7 Wearable/mental-health No N/A
TinyAgent 82,000 16 (MacOS) AppleScript workflows Yes No
CallNavi 729 579 10 (banking, HR, etc.) Yes No
Excel FTₛᵧₙ–QA 6,440 100 Excel formulae (QA/Table) N/A N/A

Synthetic datasets span real APIs (APIGen, ToolACE, FunReason), exhaustively generated/sampled API signatures (ToolACE, CallNavi), and domain-specific workflows (mind_call, TinyAgent).

5. Impact, Evaluation, and Empirical Findings

Synthetic datasets are foundational to state-of-the-art model development and benchmarking. Empirical studies report:

  • Benchmark Leadership: ToolACE-8B achieved 91.41% overall accuracy on BFCL-v1, surpassing GPT-4 preview and Claude-3.5; xLAM-7B matches or exceeds GPT-4-FC on the same benchmark (Liu et al., 2024, Liu et al., 2024).
  • Model Robustness: Masked/irrelevance-augmented sets (Hammer, ToolPRM) increase generalization to unseen toolsets and enable robust “no call” decisions (Lin et al., 2024, Lin et al., 16 Oct 2025).
  • Multi-Turn Competence: Datasets supporting multi-turn/planning dialog (FunReason-MT, BUTTON, DICE-Bench) yield 30–40 percentage point gains in multi-step agentic tasks compared to SFT on single-turn data (Xu et al., 28 Oct 2025, Chen et al., 2024, Jang et al., 28 Jun 2025).
  • Quality–Diversity Tradeoffs: Empirical ablations show that dual-layer and compositionality validation can increase end-task accuracy by 4–6 points over rule-based filters alone (ToolACE).
  • Domain Adaptation: Enterprise scenario datasets (Zeng et al., 2024), Excel (McKenna et al., 24 Mar 2025), and mind_call (mental health (Shafi et al., 11 Jan 2026)) confirm that synthetic methodologies generalize beyond standard information APIs.

6. Limitations, Open Problems, and Best Practices

Despite demonstrated utility, synthetic datasets have limitations:

  • Absence of Real User-Domain Drift: Methods relying on LLM self-play or simulation may fail to represent open-world or adversarial user behaviors, even with sophisticated prompt variation and augmentation (explicitly acknowledged by FunReason-MT and BUTTON).
  • Long-Chain and Multi-Modality Gaps: Few synthetic corpora exhibit realistic very-long dependency chains (>10 turns), multimodal context (image/audio), or dynamic tool sets (noted as a future direction by FunReason-MT, RouteNator).
  • Human Validation Scarcity: Most datasets rely on LLM self-checking or majority voting for sample acceptance; only DICE-Bench and FunReason-MT report systematic human review components.
  • Per-Task Granularity: Only a minority of pipelines report per-task or per-domain splits, limiting ablation and task-specific curriculum design [caveated in (Abdelaziz et al., 2024)].

Best practices emerging from the literature include:

  • Multi-stage, hierarchical validation (format, execution, semantic)
  • Function and parameter masking for description-based generalization
  • Balanced inclusion of single-call, multi-call, and irrelevance/no-call data
  • Coverage stratification across API domains, parameter types, and reasoning types
  • Human or LLM-based abductive dialogs for complex scenario simulation

7. Representative Examples and Schema Illustration

Below are representative examples demonstrating the breadth of synthetic function-calling data:

Granite-FunctionCalling (Nested Call Example) (Abdelaziz et al., 2024):

1
2
3
4
{
  "input": "What's the typical driving time between Las Vegas and the Grand Canyon?",
  "output": "<function_call> { \"name\": \"get_location\", \"arguments\": {\"point_on_map\": \"the Grand Canyon\"} } <function_call> { \"name\": \"get_estimated_duration\", \"arguments\": { \"source\": \"Las Vegas\", \"method_travel\": \"driving\", \"destination\": \"<function_response>get_location\" } }"
}

FunReason-MT (Multi-Turn Trajectory, Env–API Graph) (Xu et al., 28 Oct 2025): | Statistic | Value | |--------------------------|---------------------------| | Trajectories | 10,000 | | Avg. turns/trajectory | 5.3 (σ=1.8) | | API coverage | 100 % (120 tools) | | Logical dep. depth | 3.4 (σ=1.1) |

ToolACE (Compositional Parallel Calls) (Liu et al., 2024):

  • User: “Tell me upcoming Theatre, Dance, and Music events between 2021-04-01 and 2021-05-01.”
  • Assistant: Three calls to performanceArt.get_upcoming_events with category adjusted.

mind_call (Normalized, Reasoned API Invocation) (Shafi et al., 11 Jan 2026):

1
2
3
4
{
  "name": "get_sleep_data",
  "arguments": { "user_id": "user_12345", "numdays": 7 }
}

  • Explicit reasoning: maps "recently" to numdays=7.

In summary, synthetic function-calling datasets constitute a foundational methodology for equipping LLMs with robust, high-accuracy tool-use and planning capacities. The field has evolved from simple instruction reformatting to large-scale, compositional, and multi-turn simulation with tightly integrated verification. Ongoing work continues to expand complexity, domain breadth, and the representational diversity required for agentic AI in open-world, real-world applications (Abdelaziz et al., 2024, Liu et al., 2024, Lin et al., 2024, Xu et al., 28 Oct 2025, Chen et al., 2024, Liu et al., 2024, Hao et al., 7 Aug 2025, Shafi et al., 11 Jan 2026).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Synthetic Function-Calling Dataset.