Function Calling Methods in LLMs

Updated 18 January 2026

Function calling methods are techniques through which LLMs emit structured, JSON-formatted calls to interact with external APIs and tools.
They optimize performance by balancing execution time, power consumption, and accuracy through dynamic tool selection and dependency modeling.
Implementation leverages embedding-based retrieval, parallel execution, and reinforcement learning to enhance scalability and efficiency in real-world applications.

Function calling methods constitute a central paradigm in enabling LLMs to interact with external tools, APIs, and data sources. By emitting structured requests—typically in JSON or similar formalism—LLMs can execute workflows far beyond pure text generation, ranging from search and code execution to operating physical devices or querying specialized databases. The rapid emergence and diversification of function calling techniques reflects challenges in accuracy, efficiency, safety, sustainability, and protocol compatibility at scale.

1. Formal Problem Formulation and System Objectives

Function calling with LLMs involves selecting a subset of external tools or APIs to fulfill a user query, instantiating structured calls with argument extraction, reasoning about dependencies, and integrating results—all while adhering to efficiency constraints. The canonical formulation defines:

$Q$ : user query (text prompt).
$T = \{\tau_1,\ldots,\tau_N\}$ : set of available tools.
$S \subseteq T$ : subset of tools injected into the function-calling prompt.
$t(Q,S),\ p(Q,S),\ s(Q,S)$ : execution time, power consumption, and task completion success rate, conditional on query and tool subset.

The optimization target is a joint minimization: $\min_{S\subseteq T} J(S) = \alpha\, t(Q,S) + \beta\, p(Q,S) - \gamma\, s(Q,S),\quad \text{subject to}\quad |S| \leq K_{\max},\ s(Q,S)\in \{0,1\}$ Parameters $\alpha, \beta, \gamma$ reflect application priorities (e.g., latency, efficiency, correctness) (Paramanayakam et al., 2024).

This framework generalizes across retrieval-augmented prompting, greedy or RL-based selection, and multi-stage planning/execution.

2. Tool Selection and Retrieval Strategies

A bottleneck in scaling function calling lies in managing large $T$ without overwhelming the LLM's context or incurring confusion. Dynamic, query- and context-conditioned retrieval methods have emerged:

Embedding-based Nearest Neighbor: Tools are embedded (e.g., via MPNet, DeBERTa, or sentence encoders), and for a given query (or query + plan history), the closest-matching tools are selected (k-NN search using FAISS or similar) (Paramanayakam et al., 2024, Paramanayakam et al., 29 Apr 2025, Erdogan et al., 2024).
Dynamic Dependency Modeling: Modules like DTDR condition tool retrieval on both the initial query and evolving plan state. DTDR-Clustering leverages Markov graphs over co-occurrences in demonstration trajectories, while DTDR-Linear learns embeddings over query-history pairs with a multinomial classifier (Patel et al., 18 Dec 2025). This approach improves tool retrieval F1 and function selection accuracy by up to 104% over static methods.
Prompt Integration Variants: Four main styles—raw demonstrations, hard masking (only selected tools presented), soft masking (ranking or highlighting probable tools), and weighted masking (showing probability scores)—trade off prompt length, context window usage, and small-vs-large model suitability (Patel et al., 18 Dec 2025).

Rigorous pre-filtering reduces context size, improves agentic accuracy, and prevents prompt bloat, particularly on edge devices.

3. Structured Planning, Parallelism, and Asynchronicity

Sophisticated planners decompose tasks into multi-step, often parallelizable, function call graphs:

Static DAG Construction: Planners (via prompt or learned policy) output a sequence/DAG of function call steps with explicit data dependencies. Schedulers then execute "ready" nodes in parallel whenever possible (Kim et al., 2023, Singh et al., 2024).
Fused Function Calls: The LLM-Tool Compiler fuses similar-type operations on groups of tools, reducing the total number of prompt-invoked functions (from $|S|$ to $k \ll |S|$ ), analogously to hardware multiply-add fusion (Singh et al., 2024). This achieves up to 4–5× higher parallelization and 12%–40% latency/token cost reductions over naive approaches.
Asynchronous Execution: AsyncLM introduces an in-context protocol (CML) supporting non-blocking calls with interrupt injection on result arrival (Gim et al., 2024). This enables overlap of LLM generation and external execution, achieving 1.6×–5.4× total latency improvements. The protocol supports not only executor but also human and LLM-LLM interrupts, and requires fine-tuning to handle "interrupt" tokens and dynamic scheduling.

By modeling function orchestration as a program compilation and scheduling problem, these methods deliver scalable, low-latency execution.

4. Learning Paradigms: Supervised, Adversarial, and Reinforcement Learning

Beyond template-based prompting, supervised fine-tuning, adversarial data generation, and RL have all been employed:

Supervised Fine-Tuning (SFT): Models are trained to emit JSON-formatted calls—in some cases, with masking over function and parameter names to enforce reliance on semantic descriptions rather than spurious string matching (Lin et al., 2024). Balanced datasets are constructed with positive (function call) and negative (irrelevance) cases, often augmented with complexity and diversity ablations (Liu et al., 2024).
Adversarial Curriculum (ADC): Line-level code execution feedback and generator-discriminator loops increase robustness in difficult parameter-matching and hard function-calling settings, reaching top execution accuracy on BFCL-v2 (87.5%) (Zhang et al., 2024).
Reinforcement Learning (RL): Strategic entropy-based exploration over chain-of-thought segments (FunRL) stabilizes group-based relative policy optimization (GRPO), directly optimizing for AST-parsable, correct calls. RL-stage models exceed SFT-only baselines by 6pp overall accuracy on complex benchmarks (Hao et al., 7 Aug 2025).
Process-Reward Models: Fine-grained step-level rewards via process supervision (ToolPRM) for each sub-decision in function call generation (API name, param, value) enable customized beam search ("explore more, retain less"), capitalizing on the irrecoverability of errors in structured outputs (Lin et al., 16 Oct 2025).

This diversity of training regimes enables function-calling models to generalize beyond narrow domains and improves robustness, especially under multi-step, multi-function, or adversarial settings.

5. Edge Efficiency, Power, and Sustainability

Deploying function calling methods on-device demands strict optimization of memory, latency, and energy:

Dynamic Tool Subset Selection: "Less is More" (LiS) and similar schemes select a compact, high-relevance tool subset per query (3–5 tools vs. 46–51 total), leveraging embedding similarity and clustering, yielding 70% reductions in execution time and 40% power savings without fine-tuning (Paramanayakam et al., 2024).
Quantization: Systematic use of 4- or 8-bit quantized LLMs and embedding models alongside fast, approximate retrieval drastically reduces memory and computation footprint (e.g., 4x–5x smaller, >30% lower latency with no accuracy loss) (Erdogan et al., 2024, Paramanayakam et al., 29 Apr 2025).
Carbon-Aware Policies: Frameworks such as CarbonCall monitor real-time carbon intensity forecasts, dynamically adjust hardware power limits, and switch between LLM quantization levels to maximize tokens-per-second throughput under sustainability constraints (up to 52% lower emissions) (Paramanayakam et al., 29 Apr 2025).

Careful tool prompt curation, quantized inference, and sustainability-aware scheduling are now foundational for edge function calling.

6. Protocols, Tool Abstractions, and Developer Ergonomics

Complex production systems must interface with heterogeneous protocols and APIs:

Unified Tool Registries: Protocol-agnostic abstractions (e.g., ToolRegistry) encapsulate local Python callables, OpenAPI, MCP, and LangChain tools into a uniform schema: name, description, parameters (JSON-Schema), and implementation. Registry-based dispatch, automated schema generation, and dual-mode (thread/process) concurrency yield 60–80% code reduction and 3.1× throughput gains (Ding et al., 5 Aug 2025).
Schema Generation and Validation: Automated extraction of schema from function signatures and type annotations, combined with run-time JSON validation, normalizes tool registration, improves maintainability, and eliminates boilerplate.
Practical Integration Patterns: Function libraries for safety-critical domains (e.g., nuclear plant data retrieval) can be bootstrapped via NL-to-SQL tools, hardened by SMEs, and wrapped as callable stubs. This delivers higher accuracy and maintainability versus end-to-end NL-to-SQL generation, especially when paired with retrieval-augmented function selection (Costa et al., 10 Jun 2025).

Compatibility with evolving standards (OpenAI, MCP, Gemini, etc.) and one-line integration of new adapters are now essential for sustainable, robust function-calling ecosystems.

7. Benchmarking, Evaluation, and Comparative Metrics

Benchmarking of function calling methods employs a rich suite of metrics and testbeds:

BFCL (Berkeley Function Calling Leaderboard): De facto open benchmark covering single, multiple, parallel, and no-relevance (irrelevance detection) cases with strict AST and execution correctness metrics (Lin et al., 2024, Liu et al., 2024, Lin et al., 16 Oct 2025).
Function Selection F1, Parameter Matching: Mean reciprocal rank, argument completeness, and argument value correctness quantify model performance in structured output scenarios (Kavathekar et al., 27 Apr 2025).
Sustainability/Hardware Metrics: End-to-end latency, power, tokens-per-second, and carbon emissions are now routinely reported for edge-oriented systems (Paramanayakam et al., 2024, Paramanayakam et al., 29 Apr 2025).

A summary excerpt from (Paramanayakam et al., 2024):

Method	Success Rate	Exec. Time ↓	Power ↓
Default	63%	1.00	1.00
Gorilla	68%	0.62	0.75
LiS (k = 3)	72%	0.20	0.55

Such comparative tabulation highlights the concrete gains of recent retrieval and selection schemes.

Function calling methods in LLMs have transitioned from template-based, synchronous, large-context prompting to highly efficient, dynamically retrieved, parallelized, and protocol-agnostic architectures. The field integrates advanced planning and RL, process-aware reward models, edge resource optimization, and unified tool abstraction. Methodologies now address the entire tool-chain—from tool selection to structured call emission, execution orchestration, real-world sustainability, and developer productivity—setting the stage for robust, scalable, and socially responsible LLM-based agents (Paramanayakam et al., 2024, Patel et al., 18 Dec 2025, Paramanayakam et al., 29 Apr 2025, Ding et al., 5 Aug 2025, Gim et al., 2024, Lin et al., 2024, Lin et al., 16 Oct 2025, Zhang et al., 2024, Singh et al., 2024).