ToolWeaver: Scalable Tool Use Framework

Updated 5 February 2026

ToolWeaver is a generative framework for scalable tool use in LLMs that encodes each tool into hierarchical sequences capturing intrinsic semantics and collaborative relationships.
It employs logarithmic vocabulary expansion through discrete code sequences, enabling efficient multi-tool reasoning and seamless integration with large language models.
Its design combines semantic embedding, residual quantization, and collaborative-aware objectives to outperform traditional retrieval-based and generative tool-use pipelines.

ToolWeaver is a generative framework for scalable tool-use in LLMs that encodes each tool as a hierarchical sequence of discrete codes, enabling both efficient vocabulary expansion and the representation of collaborative semantic relationships among tools. Developed to address the dual semantic limitations of retrieval-based and existing generative tool-use pipelines—specifically, the inability to capture intricate intrinsic semantics and co-usage patterns—the ToolWeaver framework achieves logarithmic growth in vocabulary size with respect to the number of supported tools. This structured code approach facilitates more generalizable, efficient, and semantically-aware multi-tool reasoning for advanced AI agents (Fang et al., 29 Jan 2026).

1. Motivation and Semantic Challenges

Prevalent retrieval-based LLM tool-use architectures leverage external retrievers, such as BM25 or dense encoders, to select relevant tools from large libraries $D=\{d_1,\ldots,d_N\}$ . However, these systems are constrained by:

Under-representation of Intrinsic Semantics: Encoders typically fail to capture the nuanced, functional meaning of tools.
Absence of Extrinsic Tool Knowledge: LLMs, pretrained solely on natural language corpora, have no a priori understanding of external tool APIs, leading to gaps in multi-tool reasoning and composition.

Standard generative methods that map each tool $d_i$ to a unique token $\langle\mathrm{tool}_i\rangle$ prompt further scalability, generalization, and semantic bottleneck issues. As the tool set expands ( $N\approx47\,000$ in ToolBench), vocabulary size increases linearly ( $O(N)$ ), with each tool treated as semantically isolated, impeding generalization and collaborative usage modeling. This imposes performance and resource constraints on LLMs and hinders learning of tool relationships.

2. Hierarchical Code Sequence Representation

ToolWeaver replaces the atomic tool tokenization scheme with a sequence of $L$ discrete codes for each tool. The code sequence for tool $d$ is defined as

$s_d = [\ell_{d,1}, \ell_{d,2}, \ldots, \ell_{d,L}]$

where each $\ell_{d,l}\in \{1,\dots,K\}$ indexes into the $l$ th codebook $C_l$ . The total number of new tokens introduced is $L \cdot K$ , offering exponential encoding capacity: $K^L \geq N$ , where $N$ is the number of tools.

The crucial property of this design is logarithmic vocabulary expansion:

$L = \left\lceil \frac{\ln N}{\ln K}\right\rceil$

so the number of additional tokens scales as $O(\ln N)$ . This stands in contrast to the $O(N)$ growth of monolithic token schemes.

By embedding both intrinsic semantics (functional similarity) and extrinsic co-usage patterns (collaborative relationships captured via cosine similarities on tool usage), ToolWeaver encodes tools such that jointly-used tools share early code prefixes, directly supporting multi-tool reasoning and efficient generalization.

3. Collaborative-Aware Structured Tokenization

The code sequence assignment follows an explicit collaborative-aware residual quantization procedure:

Semantic Embedding: For each tool, a text encoder computes $e_d$ from documentation.
Projection: $z_d = W \cdot e_d$ , projecting embedding into a lower-dimensional space.
Residual Initialization: $r_{d,1} = z_d$ .
Codebook Quantization: For each level $l$ $l$ ( $1 \le l \le L$ $1 \leq l \leq L$ ):
- Assign code index:
$\ell_{d,l} = \arg\min_{k\in[1,K]} \| r_{d,l} - v_{l,k} \|^2$

- Update residual:

$r_{d,l+1} = r_{d,l} - v_{l,\ell_{d,l}}$

Collaborative-Aware Objective: Optimize codebooks $\{v_{l,k}\}$ to minimize:

$\mathcal{L} = \mathcal{L}_\mathrm{recon} + \mathcal{L}_\mathrm{quant} + \mathcal{L}_\mathrm{collab}$

with

$\mathcal{L}_\mathrm{collab} = \lambda\sum_{u<v} A_{u,v}\|s_u-s_v\|^2$

where $A_{u,v}$ encodes normalized co-occurrence.
Conflict Mitigation: At the final codebook level, a balanced assignment via the Sinkhorn–Knopp algorithm ensures uniform tool distribution among code indices, averting index collisions.

This process tightly weaves semantic and collaborative signals into tool code assignments.

4. Generative Alignment and Model Integration

Each code $\ell_{d,l}$ is materialized as a new special token $\langle T_{l,\ell}\rangle$ whose embedding is randomly initialized and fine-tuned during alignment. The alignment is conducted in two stages:

Tool Retrieval Alignment: The LLM minimizes negative log-likelihood over code sequences $s_d$ conditioned on tool queries $q$ :

$\mathcal{L}_\mathrm{retrieval} = -\mathbb{E}_{(q, d)}\log P(s_d | q)$

Tool Usage Trajectory Alignment: Training continues on full multi-step tool usage trajectories via autoregressive cross-entropy loss:

$\mathcal{L}_\mathrm{usage} = \sum_{t=1}^T -\log P(c_t | c_{< t},\, \mathrm{context})$

During inference, beam search with trie constraints efficiently restricts decoding to valid code sequences, ensuring only legitimate tool identifiers are generated.

5. Empirical Performance and Analysis

ToolWeaver was evaluated on ToolBench, comprising approximately $47\,000$ real-world APIs, using standard splits: I1 (single-tool), I2 (multi-tool/one category), I3 (multi-tool/multi-category), and generalization splits ("Tool.", "Cat.").

Retrieval Efficacy

Method	I1@1	I3@1	I3@5
BM25	26.92	10.00	12.33
EmbSim	50.50	18.00	20.94
ToolRetriever	75.92	28.00	44.54
ToolGen	88.50	81.00	85.83
ToolWeaver	91.16	88.00	90.12

ToolWeaver substantially outperforms all retrieval and generative baselines in NDCG@k, particularly for complex multi-tool queries (I3).

End-to-End Tool Use

Method	I3 SoPR	I3 SoWR
ToolGen	36.34	45.56
ToolWeaver	52.19	59.02

On end-to-end metrics, ToolWeaver's solvable pass rate and win rate notably exceed those of prior work, especially on tasks requiring cross-category multi-tool composition.

Ablation and Tokenization Comparisons

Ablations confirm the necessity of both semantic initialization and collaborative-aware code assignment (collaborative weight $\lambda = 1$ maximizes NDCG). Static tree, atomic, numerical, or semantics-only tokenizations underperform the full collaborative approach.

Language Modeling Preservation

Model	WikiText-2 PPL	CNN/DM BERTScore
Llama-3-8B	6.34	0.8535
ToolGen	104.54	0.8293
ToolWeaver	25.36	0.8507

Unlike linear vocabulary expansion, which degrades perplexity and summarization quality, ToolWeaver's compact approach notably preserves core language modeling capabilities.

Inference Efficiency

Despite the multi-token output, inference remains fast ( $<200\,\mathrm{ms}$ per call for $L=4$ on A100) and memory efficient (15.1 GB vs. 15.8 GB for ToolGen).

6. Limitations and Future Directions

Potential limitations include increased autoregressive error rates for longer code sequences ( $L>4$ ), due to error propagation during generation. The residual quantization process is currently unsupervised; incorporation of reinforcement learning from real tool-use feedback is proposed as a future enhancement for refining code assignments. A plausible implication is that supervised or RL-based code adaptation could further improve generalization and downstream agent performance.

ToolWeaver establishes a hierarchical, collaborative-aware tokenization paradigm for tool-augmented LLMs, supporting scalable addition of external functionalities while preserving semantic structure and language modeling competence. Code and data are available at https://github.com/Fwibo/ToolWeaver (Fang et al., 29 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

ToolWeaver: Weaving Collaborative Semantics for Scalable Tool Use in Large Language Models (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ToolWeaver.