ToolAlpaca-13B: Compact LLM for Generalized Tool Use
- ToolAlpaca-13B is a 13-billion-parameter language model that employs a multi-agent simulation pipeline to achieve near-GPT-3.5 tool-use proficiency.
- The model fine-tunes on a diverse, automatically generated API corpus with minimal human annotation, using a ReAct-style framework for tool interactions.
- Empirical results highlight its high function and argument accuracy on unseen APIs, demonstrating effective generalization with modest training data.
ToolAlpaca-13B is a 13-billion-parameter LLM that demonstrates strong generalized tool-use capabilities via an LLM-driven data-simulation and fine-tuning pipeline. Unlike prior approaches that either require extremely large models (such as GPT-4) for zero-shot tool use or rely on tool-specific supervision in compact models, ToolAlpaca-13B achieves near-GPT-3.5 proficiency in generalized tool learning by leveraging a highly diverse, automatically generated corpus, with minimal human annotation and no architectural changes to the base model. Its results suggest that with sufficient task diversity and corpus engineering, compact autoregressive Transformers can acquire broad tool-use abilities and generalize to previously unseen APIs (Tang et al., 2023).
1. Multi-Agent Corpus Generation via Simulation
ToolAlpaca-13B's foundation is a simulated tool-use corpus, generated with a multi-agent pipeline operated by off-the-shelf LLMs under strict prompt templates. The simulation includes three agent roles:
- User-Agent: Given a tool's structured documentation (name, introduction, descriptions, function-level docs, OpenAPI spec), drafts diverse instructions (e.g., “List all public holidays in Japan in 2024”) and responds to clarifying queries from the assistant.
- Assistant-Agent: Executes a ReAct-style pattern, producing alternations of Thought, Action (tool call), and Observation steps, formatted precisely as
until a final Response is produced.1 2
Action name: <function_name> Input: <JSON arguments>
- Tool-Executor Agent: Receives the assistant's action and the tool's OpenAPI spec, and generates HTTP-style simulated responses, including status codes and JSON payloads.
Each tool-use instance is collected as a triple:
All user and tool-executor actions use ChatGPT, and the assistant operates with GPT-3.5, maintaining strict output format consistency.
2. Scale, Diversity, and Characteristics of the Tool-Use Corpus
The corpus construction process samples 500 APIs from the "public-apis" GitHub repository (approximately 1,400 APIs across 50+ categories) and filters them for valid, text-based, and parseable characteristics. The final dataset comprises 426 unique tools, each assigned ten user instructions generated by the User-Agent. Tool-use traces are collected through the assistant–tool-executor loop, producing a total corpus of 3,938 instances, with the following characteristics:
| Statistic | Value |
|---|---|
| Number of Tool Categories | 50 |
| Number of Unique Tools | 426 |
| Total Instances | 3,938 |
| Single-function-call Instances | 2,512 |
| Multi-function-call Instances | 1,426 |
| Avg. Functions per Tool | 4.85 |
| Avg. Steps per Instance | 1.66 |
| Avg. Instruction Length (tokens) | 23.4 |
| Avg. Output Length (tokens) | 36.2 |
This large-scale, highly diversified corpus ensures coverage over a spectrum of real-world tool APIs (including Calendar, Transportation, Weather, Finance, Entertainment, Multi-Modal categories).
3. Fine-Tuning Procedure and Model Architecture
ToolAlpaca-13B employs Vicuna-13B (itself distilled from LLaMA-13B) as its base, retaining the standard 13-billion-parameter autoregressive Transformer architecture with no structural changes.
Fine-Tuning Regimen:
- Data: Complete ToolAlpaca corpus (3,938 instances)
- Hyperparameters:
- Optimizer: AdamW
- Learning rate: (linear warmup for 3% of steps, cosine decay)
- Batch size: 128
- Weight decay: 0.0
- Epochs: 3
- Max sequence length: 2048 tokens
- Training Objective: Standard next-token log-likelihood:
where is the concatenated token sequence of user and assistant turns.
- No auxiliary losses or value heads are used.
- Prompt format enforces clear delineation of roles and ReAct-style tool calls during supervision.
At inference, ToolAlpaca-13B is prompted identically and expected to emit tool calls and responses in the same ReAct-based format as observed during fine-tuning.
4. Zero-Shot Evaluation on Unseen APIs
ToolAlpaca-13B’s generalization is benchmarked on subsets of APIs and external toolsets not seen during training:
- Simulated Tools Subset: 10 APIs, 100 instances (generated and manually verified)
- Real-World APIs Subset: 11 external APIs (e.g., Nager.Date, airportsapi, weatherstack), 114 instances
- GPT4Tools Set: 8 multi-modal tools from GPT4Tools benchmark, 652 instances (after filtering)
Evaluation Metrics:
- Procedure Accuracy (SR): Correct selection of function(s) and parameters.
- Argument Accuracy (SR): Well-formed JSON arguments (types, values).
- Overall Success Rate (SR): Both above are correct and the final answer satisfies the user.
Machine evaluation is conducted using GPT-4 as a judge, with a small human accept-rate collected for the simulated subset.
Quantitative Results
| Model | Simulated SR | Simulated Human | Real-World SR |
|---|---|---|---|
| GPT-3.5 | 75.0% | 79.0% | 72.8% |
| Vicuna-13B (zero-shot) | 16.0% | 25.0% | 12.3% |
| ToolAlpaca-13B | 70.0% | 75.0% | 61.4% |
For the GPT4Tools subset:
| Model | SR | SR | SR | SR |
|---|---|---|---|---|
| GPT-3.5 | 99.5% | 99.5% | 91.5% | 91.5% |
| Vicuna-13B (zero-shot) | 84.4% | 43.7% | 46.7% | 26.2% |
| GPT4Tools (71K cases) | 98.2% | 97.0% | 92.2% | 90.6% |
| ToolAlpaca-13B (3.9K) | — | 95.5% | 85.3% | 83.7% |
With only 3,938 simulated examples, ToolAlpaca-13B approaches the GPT-3.5 baseline and achieves 83.7% overall success on the out-of-dataset GPT4Tools benchmark, rivaling models trained on orders-of-magnitude more data.
5. Empirical Analyses: Ablations and Insights
Tool Diversity
Controlling for instance count (3,938), varying the number of distinct tools during training yields:
- 10 tools → 51% overall accuracy
- 40 tools → ≈58%
- 100 tools → ≈64%
- 400 tools → 70%
Diversity in APIs is thus the primary driver of generalization, with model scale contributing secondarily.
Model Scale
Comparing 7B and 13B parameter variants of ToolAlpaca under identical training, the 13B model consistently outperforms the 7B by 5–10 points on simulated and real-world benchmarks. Model capacity is beneficial only when coupled with sufficient tool diversity.
Data Quality
Human review (100 instances) reports:
- Instruction solvability: 88%
- Tool executor response correctness: 92%
- Assistant action and final response accuracy: 80%
This validates the multi-agent simulation pipeline’s ability to produce high-quality, realistic tool-use traces.
6. Practical Implications and Limitations
ToolAlpaca-13B establishes that a 13B-parameter generic autoregressive Transformer, when fine-tuned on a carefully constructed simulated corpus, can emulate GPT-3.5-level proficiency in zero-shot tool use with just 3,938 examples. The paradigm leverages large LLMs to efficiently synthesize training data in place of labor-intensive hand annotation.
Limitations:
- Simulated environments may lack real-world uncertainty, latency, and error modes.
- The model is currently restricted to text-based APIs.
- Extending to multi-modal or stateful robotics tools is unaddressed.
- Integrating real API traces or human-in-the-loop curation could further enhance robustness.
A plausible implication is that simulation-based data acquisition, when combined with strong backbone models and rigorous prompt/formats, offers an efficient route to general-purpose tool-using LLMs on modest scales (Tang et al., 2023).