UniToolCall: Unified Tool-Calling Framework

Updated 2 July 2026

UniToolCall is a unified framework that consolidates methods, data protocols, and evaluation techniques for function-calling in large language models.
It employs a modular architecture with adapter, abstraction, registry, and execution layers to standardize tool integration across diverse platforms.
The framework optimizes real-time decoding and decision evaluation through auto-tuning, generative tool retrieval, and robust schema unification.

Unified ToolCall (UniToolCall) is a comprehensive paradigm and suite of frameworks unifying the methods, data, and evaluation protocols for function-calling in LLM agents. Originally conceived to address the fragmentation across function-calling protocols, interaction formats, benchmarks, and deployment architectures, UniToolCall now encompasses both practical tool integration architectures and standardized data-driven methodologies for tool-use modeling, real-time decoding, and decision evaluation. Its scope includes registry-based tool integration (Ding et al., 5 Aug 2025), unification of tool-use data and structural representations (Liang et al., 13 Apr 2026), efficient real-time execution (Shi et al., 4 Feb 2026), generative tool retrieval (Wang et al., 2024), and fine-grained decision benchmarking (Ross et al., 26 Apr 2025).

1. System Architecture and Protocol-Agnostic Tool Integration

The core UniToolCall framework is structured around four modular layers:

Adapter Layer: Implements protocol-specific adapters for Python functions, OpenAPI, Multi-Component Protocol (MCP), and LangChain, normalizing tool descriptions and invocations into a unified intermediate ("ToolCall") representation.
Core Abstraction Layer: Encapsulates each tool as an object containing its name, description, JSON-schema parameter set, Python callable, async/sync flag, and a Pydantic-validated parameter model. This abstraction supports strong schema guarantees and seamless integration.
Registry Layer: Provides a namespace-indexed tool store with support for merging, spinoff, namespace reduction, and conflict resolution. Tool registration is supported for local functions, HTTP endpoints, MCP services, and agent plugins.
Execution Engine: Offers dual-mode concurrency—using thread pools for I/O-bound calls and process pools (with dill serialization) for CPU-bound or isolation-critical operations. An auto-tuner dynamically selects the optimal mode per invocation, based on recent execution times.

The key architectural principle is protocol-agnosticism: once a tool is registered, the rest of the stack remains oblivious to whether the tool is a local Python callable, a web service, a streaming SSE source, or a complex agent. API-compatibility layers translate between OpenAI-style function-call fields and the internal ToolCall, ensuring compatibility with mainstream model APIs. This design enables full codebase reuse and consistent workflows across disparate tool sources (Ding et al., 5 Aug 2025).

2. Automated Schema and Representation Unification

UniToolCall eliminates manual specification of tool schemas. For local functions, it introspects function signatures and type annotations, building a Pydantic dynamic model and exporting a standards-compliant JSON schema. Required parameters are mathematically derived by

$\mathtt{required} = \{x_i \mid x_i.\text{default} = \text{Undefined}\}$

where $x_i$ enumerates the parameter list (Ding et al., 5 Aug 2025). The system robustly supports Union and Optional types and flattens nested schema constructs to maintain OpenAI and JSON-schema compatibility. Caching is employed so that repeated calls incur negligible overhead.

At the representation level, dataset-centric variants of UniToolCall (as in (Liang et al., 13 Apr 2026)) unify tool-use trajectories across public and synthetic corpora through the Query–Action–Observation–Answer (QAOA) schema. Each function call episode is represented as $(Q, A, O, R)$ tuples (user query, tool call JSON, tool invocation result, and model answer), with extensions for parallel and multi-turn structures:

$\tau = (q, a_1, o_1, r_1, \ldots, a_K, o_K, r_K)$

$\{(Q_t, (A_{t,k}, O_{t,k})_{k=1}^{K_t}, R_t)\}_{t=1}^{T}$

This formalism standardizes representation, enabling consistent training and evaluation across a heterogeneous tool ecosystem (Liang et al., 13 Apr 2026).

3. High-Performance Execution and Real-Time Decoding

UniToolCall integrates multiple optimizations for throughput and real-time performance:

Concurrent Execution: The executor maintains both thread and process pools. Threading achieves up to 2.4× speed-up for native functions; processing yields up to 3.1× gains for CPU-heavy or serialization-dependent tasks (100 concurrency benchmark).
Auto-Tuning: The engine selects the concurrency mode per-call if

$\frac{\overline{T_{\text{exec}}^\text{thread}}}{\overline{T_{\text{exec}}^\text{process}}} > \theta$

with a tunable $\theta$ (default: 1.2), promoting optimal scaling (Ding et al., 5 Aug 2025).

Parallel Decoding (SimpleTool): Through introduction of 17 "mode-selector" tokens and multi-head generation, function name and arguments are decoded in parallel, rather than autoregressively. Latency is reduced from

$T_{\mathrm{baseline}} = T_{p} + N \times T_{d}$

$T_{\mathrm{ours}} \approx T_{p} + \max_{i}{(N_i)}T_{d}$

where $T_p$ is prefix setup, $x_i$ 0 per-token cost, and $x_i$ 1 is the length of each output stream. Token compression ( $x_i$ 2) and parallelization yields a 3–6× end-to-end speed-up, with P50 latency of 61 ms (Qwen4B + AWQ4 on RTX 4090) (Shi et al., 4 Feb 2026).

Scalability: Batch efficiency remains high (93% at $x_i$ 3), with negligible overhead. This enables UniToolCall to serve latency-critical applications, surpassing previous single-stream or retriever-based architectures (Shi et al., 4 Feb 2026).

4. Toolset, Data, and QAOA Evaluation Unification

A distinguishing feature of UniToolCall in the data-centric sense (Liang et al., 13 Apr 2026) is unification across toolset curation, hybrid corpus generation, and evaluation:

Tool Pool: 22,606 tools spanning 6 functional categories and 13 application domains. Filtering applied via deduplication, schema check, exclusion of temporal parameters, and cosine similarity of name+description.
Datasets: 390,060 training trajectories, including standardized conversions from ten public datasets and structurally controlled synthetic data supporting serial, parallel, single/multi-hop, and multi-turn tool-use.
Interaction Modeling: Structural variants include serial vs. parallel call execution, single-hop ( $x_i$ 4) vs. multi-hop ( $x_i$ 5), and single-turn ( $x_i$ 6) vs. multi-turn ( $x_i$ 7) setups. Anchor Linkage injects prior-turn state into subsequent queries, enforcing long-horizon coherence:

$(Q, A, O, R)$ 2

QAOA Evaluation: Seven public benchmarks are converted to QAOA format, supporting call-level, turn-level, and conversation-level precision, parameter accuracy, and semantic fidelity via ROUGE-L:

$x_i$ 8

(similar definitions for FP, SPA, FPA)

This yields fine-grained, comparable performance diagnostics across open and commercial LLMs (Liang et al., 13 Apr 2026).

5. Generative Tool Retrieval and Unified Invocation

Tool selection and retrieval are natively unified within the generation process by augmenting the LLM’s vocabulary with one token per tool (Wang et al., 2024). ToolGen’s architecture (modeled as "UniToolCall" in (Wang et al., 2024)) extends the vocabulary, enabling the model to generate tool calls as atomic actions:

Embedding: Each new tool token’s embedding is initialized as the mean of its name’s original subword embeddings.
Training: Multi-stage process including memorization (outputting tool token for API doc), retrieval (choosing the correct tool(s) for a query), and end-to-end agent tuning (with ReAct-style multi-turn data).
Decoding: Trie-based constrained decoding ensures that only valid tool tokens are emitted, eliminating partial/hallucinated tool names.

This paradigm achieves retrieval and invocation performance superior or comparable to prior retriever+generator models and eliminates all retrieval latency (Wang et al., 2024).

6. Decision Evaluation: When (Not) to Call Tools

Tool-use quality depends not only on correctness but the decision to call at all. The When2Call benchmark (Ross et al., 26 Apr 2025) formulates the decision as a 4-way classification:

(a) Direct text answer (no tool call—hallucination if tool is needed)
(b) Correct tool call
(c) Follow-up question (if user query under-specifies)
(d) Unable to answer

By mixing real, follow-up, and no-call scenarios, and enforcing evaluation on balanced splits, When2Call exposes overcalling, hallucination, and failure to refuse as endemic issues. Preference-Optimization (RPO) training, using pairwise ranking and KL-regularized objectives,

$x_i$ 9

significantly reduces hallucination rates ( $(Q, A, O, R)$ 02%) and increases F1 (to $(Q, A, O, R)$ 152% for Mistral-NeMo-Minitron 8B) over naive supervised fine-tuning (Ross et al., 26 Apr 2025).

Key design principles for unified tool-calling systems include explicit separation of when-to-call versus how-to-call, generation of realistic negative samples, and prompt/format consistency (Ross et al., 26 Apr 2025).

7. Key Contributions, Limitations, and Outlook

UniToolCall as a paradigm delivers:

Canonical abstraction across protocols and representations, enabling registry-based, composable tool integration with minimal code and consistent performance (Ding et al., 5 Aug 2025).
Unified representation of tool-use data (QAOA) enabling granular evaluation, structural diversity modeling, and large-scale, multi-domain pretraining (Liang et al., 13 Apr 2026).
Real-time decoding techniques bridging lab and edge deployment scenarios (Shi et al., 4 Feb 2026).
Direct integration of tool retrieval with generative decoding via vocabulary augmentation (Wang et al., 2024).
Systematic frameworks for decision-level benchmarking and RPO-based training to handle non-trivial tool-use behavior (Ross et al., 26 Apr 2025).

Notable limitations include serialization constraints in process-based execution, reduced generalization to unseen tools in generative models, and potential errors in synthetic QAOA or When2Call data. The overall framework remains extensible: protocol adapters, anchor mechanisms, and schema processors can be added as needed, with future work focusing on adaptive executors, direct MCP support, continuous tool embedding, and on-device real-time function calling.

UniToolCall, as documented and evaluated across these research lines, serves as a foundational standard for the next generation of LLM agent tool-use research.