ToolWeaver: Scalable Tool Use Framework
- ToolWeaver is a generative framework for scalable tool use in LLMs that encodes each tool into hierarchical sequences capturing intrinsic semantics and collaborative relationships.
- It employs logarithmic vocabulary expansion through discrete code sequences, enabling efficient multi-tool reasoning and seamless integration with large language models.
- Its design combines semantic embedding, residual quantization, and collaborative-aware objectives to outperform traditional retrieval-based and generative tool-use pipelines.
ToolWeaver is a generative framework for scalable tool-use in LLMs that encodes each tool as a hierarchical sequence of discrete codes, enabling both efficient vocabulary expansion and the representation of collaborative semantic relationships among tools. Developed to address the dual semantic limitations of retrieval-based and existing generative tool-use pipelines—specifically, the inability to capture intricate intrinsic semantics and co-usage patterns—the ToolWeaver framework achieves logarithmic growth in vocabulary size with respect to the number of supported tools. This structured code approach facilitates more generalizable, efficient, and semantically-aware multi-tool reasoning for advanced AI agents (Fang et al., 29 Jan 2026).
1. Motivation and Semantic Challenges
Prevalent retrieval-based LLM tool-use architectures leverage external retrievers, such as BM25 or dense encoders, to select relevant tools from large libraries . However, these systems are constrained by:
- Under-representation of Intrinsic Semantics: Encoders typically fail to capture the nuanced, functional meaning of tools.
- Absence of Extrinsic Tool Knowledge: LLMs, pretrained solely on natural language corpora, have no a priori understanding of external tool APIs, leading to gaps in multi-tool reasoning and composition.
Standard generative methods that map each tool to a unique token prompt further scalability, generalization, and semantic bottleneck issues. As the tool set expands ( in ToolBench), vocabulary size increases linearly (), with each tool treated as semantically isolated, impeding generalization and collaborative usage modeling. This imposes performance and resource constraints on LLMs and hinders learning of tool relationships.
2. Hierarchical Code Sequence Representation
ToolWeaver replaces the atomic tool tokenization scheme with a sequence of discrete codes for each tool. The code sequence for tool is defined as
where each indexes into the th codebook . The total number of new tokens introduced is , offering exponential encoding capacity: , where is the number of tools.
The crucial property of this design is logarithmic vocabulary expansion:
so the number of additional tokens scales as . This stands in contrast to the growth of monolithic token schemes.
By embedding both intrinsic semantics (functional similarity) and extrinsic co-usage patterns (collaborative relationships captured via cosine similarities on tool usage), ToolWeaver encodes tools such that jointly-used tools share early code prefixes, directly supporting multi-tool reasoning and efficient generalization.
3. Collaborative-Aware Structured Tokenization
The code sequence assignment follows an explicit collaborative-aware residual quantization procedure:
- Semantic Embedding: For each tool, a text encoder computes from documentation.
- Projection: , projecting embedding into a lower-dimensional space.
- Residual Initialization: .
- Codebook Quantization: For each level ():
- Assign code index:
- Update residual:
- Collaborative-Aware Objective: Optimize codebooks to minimize:
with
where encodes normalized co-occurrence.
- Conflict Mitigation: At the final codebook level, a balanced assignment via the Sinkhorn–Knopp algorithm ensures uniform tool distribution among code indices, averting index collisions.
This process tightly weaves semantic and collaborative signals into tool code assignments.
4. Generative Alignment and Model Integration
Each code is materialized as a new special token whose embedding is randomly initialized and fine-tuned during alignment. The alignment is conducted in two stages:
- Tool Retrieval Alignment: The LLM minimizes negative log-likelihood over code sequences conditioned on tool queries :
- Tool Usage Trajectory Alignment: Training continues on full multi-step tool usage trajectories via autoregressive cross-entropy loss:
During inference, beam search with trie constraints efficiently restricts decoding to valid code sequences, ensuring only legitimate tool identifiers are generated.
5. Empirical Performance and Analysis
ToolWeaver was evaluated on ToolBench, comprising approximately real-world APIs, using standard splits: I1 (single-tool), I2 (multi-tool/one category), I3 (multi-tool/multi-category), and generalization splits ("Tool.", "Cat.").
Retrieval Efficacy
| Method | I1@1 | I3@1 | I3@5 |
|---|---|---|---|
| BM25 | 26.92 | 10.00 | 12.33 |
| EmbSim | 50.50 | 18.00 | 20.94 |
| ToolRetriever | 75.92 | 28.00 | 44.54 |
| ToolGen | 88.50 | 81.00 | 85.83 |
| ToolWeaver | 91.16 | 88.00 | 90.12 |
ToolWeaver substantially outperforms all retrieval and generative baselines in NDCG@k, particularly for complex multi-tool queries (I3).
End-to-End Tool Use
| Method | I3 SoPR | I3 SoWR |
|---|---|---|
| ToolGen | 36.34 | 45.56 |
| ToolWeaver | 52.19 | 59.02 |
On end-to-end metrics, ToolWeaver's solvable pass rate and win rate notably exceed those of prior work, especially on tasks requiring cross-category multi-tool composition.
Ablation and Tokenization Comparisons
Ablations confirm the necessity of both semantic initialization and collaborative-aware code assignment (collaborative weight maximizes NDCG). Static tree, atomic, numerical, or semantics-only tokenizations underperform the full collaborative approach.
Language Modeling Preservation
| Model | WikiText-2 PPL | CNN/DM BERTScore |
|---|---|---|
| Llama-3-8B | 6.34 | 0.8535 |
| ToolGen | 104.54 | 0.8293 |
| ToolWeaver | 25.36 | 0.8507 |
Unlike linear vocabulary expansion, which degrades perplexity and summarization quality, ToolWeaver's compact approach notably preserves core language modeling capabilities.
Inference Efficiency
Despite the multi-token output, inference remains fast ( per call for on A100) and memory efficient (15.1 GB vs. 15.8 GB for ToolGen).
6. Limitations and Future Directions
Potential limitations include increased autoregressive error rates for longer code sequences (), due to error propagation during generation. The residual quantization process is currently unsupervised; incorporation of reinforcement learning from real tool-use feedback is proposed as a future enhancement for refining code assignments. A plausible implication is that supervised or RL-based code adaptation could further improve generalization and downstream agent performance.
ToolWeaver establishes a hierarchical, collaborative-aware tokenization paradigm for tool-augmented LLMs, supporting scalable addition of external functionalities while preserving semantic structure and language modeling competence. Code and data are available at https://github.com/Fwibo/ToolWeaver (Fang et al., 29 Jan 2026).