Papers
Topics
Authors
Recent
Search
2000 character limit reached

ToolAlpaca-13B: Compact LLM for Generalized Tool Use

Updated 13 March 2026
  • ToolAlpaca-13B is a 13-billion-parameter language model that employs a multi-agent simulation pipeline to achieve near-GPT-3.5 tool-use proficiency.
  • The model fine-tunes on a diverse, automatically generated API corpus with minimal human annotation, using a ReAct-style framework for tool interactions.
  • Empirical results highlight its high function and argument accuracy on unseen APIs, demonstrating effective generalization with modest training data.

ToolAlpaca-13B is a 13-billion-parameter LLM that demonstrates strong generalized tool-use capabilities via an LLM-driven data-simulation and fine-tuning pipeline. Unlike prior approaches that either require extremely large models (such as GPT-4) for zero-shot tool use or rely on tool-specific supervision in compact models, ToolAlpaca-13B achieves near-GPT-3.5 proficiency in generalized tool learning by leveraging a highly diverse, automatically generated corpus, with minimal human annotation and no architectural changes to the base model. Its results suggest that with sufficient task diversity and corpus engineering, compact autoregressive Transformers can acquire broad tool-use abilities and generalize to previously unseen APIs (Tang et al., 2023).

1. Multi-Agent Corpus Generation via Simulation

ToolAlpaca-13B's foundation is a simulated tool-use corpus, generated with a multi-agent pipeline operated by off-the-shelf LLMs under strict prompt templates. The simulation includes three agent roles:

  • User-Agent: Given a tool's structured documentation (name, introduction, descriptions, function-level docs, OpenAPI spec), drafts diverse instructions (e.g., “List all public holidays in Japan in 2024”) and responds to clarifying queries from the assistant.
  • Assistant-Agent: Executes a ReAct-style pattern, producing alternations of Thought, Action (tool call), and Observation steps, formatted precisely as
    1
    2
    
    Action name: <function_name>
    Input: <JSON arguments>
    until a final Response is produced.
  • Tool-Executor Agent: Receives the assistant's action and the tool's OpenAPI spec, and generates HTTP-style simulated responses, including status codes and JSON payloads.

Each tool-use instance is collected as a triple:

{Instruction,[(Thought1,Action1,Observation1),,(Thoughtk,Actionk,Observationk)],Final Response}\{ \text{Instruction}, [(\text{Thought}_1, \text{Action}_1, \text{Observation}_1), \ldots, (\text{Thought}_k, \text{Action}_k, \text{Observation}_k)], \text{Final Response} \}

All user and tool-executor actions use ChatGPT, and the assistant operates with GPT-3.5, maintaining strict output format consistency.

2. Scale, Diversity, and Characteristics of the Tool-Use Corpus

The corpus construction process samples 500 APIs from the "public-apis" GitHub repository (approximately 1,400 APIs across 50+ categories) and filters them for valid, text-based, and parseable characteristics. The final dataset comprises 426 unique tools, each assigned ten user instructions generated by the User-Agent. Tool-use traces are collected through the assistant–tool-executor loop, producing a total corpus of 3,938 instances, with the following characteristics:

Statistic Value
Number of Tool Categories 50
Number of Unique Tools 426
Total Instances 3,938
Single-function-call Instances 2,512
Multi-function-call Instances 1,426
Avg. Functions per Tool 4.85
Avg. Steps per Instance 1.66
Avg. Instruction Length (tokens) 23.4
Avg. Output Length (tokens) 36.2

This large-scale, highly diversified corpus ensures coverage over a spectrum of real-world tool APIs (including Calendar, Transportation, Weather, Finance, Entertainment, Multi-Modal categories).

3. Fine-Tuning Procedure and Model Architecture

ToolAlpaca-13B employs Vicuna-13B (itself distilled from LLaMA-13B) as its base, retaining the standard 13-billion-parameter autoregressive Transformer architecture with no structural changes.

Fine-Tuning Regimen:

  • Data: Complete ToolAlpaca corpus (3,938 instances)
  • Hyperparameters:
    • Optimizer: AdamW
    • Learning rate: 2×1052\times10^{-5} (linear warmup for 3% of steps, cosine decay)
    • Batch size: 128
    • Weight decay: 0.0
    • Epochs: 3
    • Max sequence length: 2048 tokens
  • Training Objective: Standard next-token log-likelihood:

L=t=1nlogP(yty<t;θ)L = -\sum_{t=1}^n \log P(y_t \mid y_{<t}; \theta)

where y1..ny_{1..n} is the concatenated token sequence of user and assistant turns.

  • No auxiliary losses or value heads are used.
  • Prompt format enforces clear delineation of roles and ReAct-style tool calls during supervision.

At inference, ToolAlpaca-13B is prompted identically and expected to emit tool calls and responses in the same ReAct-based format as observed during fine-tuning.

4. Zero-Shot Evaluation on Unseen APIs

ToolAlpaca-13B’s generalization is benchmarked on subsets of APIs and external toolsets not seen during training:

  • Simulated Tools Subset: 10 APIs, 100 instances (generated and manually verified)
  • Real-World APIs Subset: 11 external APIs (e.g., Nager.Date, airportsapi, weatherstack), 114 instances
  • GPT4Tools Set: 8 multi-modal tools from GPT4Tools benchmark, 652 instances (after filtering)

Evaluation Metrics:

  • Procedure Accuracy (SRact_{act}): Correct selection of function(s) and parameters.
  • Argument Accuracy (SRargs_{args}): Well-formed JSON arguments (types, values).
  • Overall Success Rate (SR): Both above are correct and the final answer satisfies the user.

Machine evaluation is conducted using GPT-4 as a judge, with a small human accept-rate collected for the simulated subset.

Quantitative Results

Model Simulated SR Simulated Human Real-World SR
GPT-3.5 75.0% 79.0% 72.8%
Vicuna-13B (zero-shot) 16.0% 25.0% 12.3%
ToolAlpaca-13B 70.0% 75.0% 61.4%

For the GPT4Tools subset:

Model SRt_t SRact_{act} SRargs_{args} SR
GPT-3.5 99.5% 99.5% 91.5% 91.5%
Vicuna-13B (zero-shot) 84.4% 43.7% 46.7% 26.2%
GPT4Tools (71K cases) 98.2% 97.0% 92.2% 90.6%
ToolAlpaca-13B (3.9K) 95.5% 85.3% 83.7%

With only 3,938 simulated examples, ToolAlpaca-13B approaches the GPT-3.5 baseline and achieves 83.7% overall success on the out-of-dataset GPT4Tools benchmark, rivaling models trained on orders-of-magnitude more data.

5. Empirical Analyses: Ablations and Insights

Tool Diversity

Controlling for instance count (3,938), varying the number of distinct tools during training yields:

  • 10 tools → 51% overall accuracy
  • 40 tools → ≈58%
  • 100 tools → ≈64%
  • 400 tools → 70%

Diversity in APIs is thus the primary driver of generalization, with model scale contributing secondarily.

Model Scale

Comparing 7B and 13B parameter variants of ToolAlpaca under identical training, the 13B model consistently outperforms the 7B by 5–10 points on simulated and real-world benchmarks. Model capacity is beneficial only when coupled with sufficient tool diversity.

Data Quality

Human review (100 instances) reports:

  • Instruction solvability: 88%
  • Tool executor response correctness: 92%
  • Assistant action and final response accuracy: 80%

This validates the multi-agent simulation pipeline’s ability to produce high-quality, realistic tool-use traces.

6. Practical Implications and Limitations

ToolAlpaca-13B establishes that a 13B-parameter generic autoregressive Transformer, when fine-tuned on a carefully constructed simulated corpus, can emulate GPT-3.5-level proficiency in zero-shot tool use with just 3,938 examples. The paradigm leverages large LLMs to efficiently synthesize training data in place of labor-intensive hand annotation.

Limitations:

  • Simulated environments may lack real-world uncertainty, latency, and error modes.
  • The model is currently restricted to text-based APIs.
  • Extending to multi-modal or stateful robotics tools is unaddressed.
  • Integrating real API traces or human-in-the-loop curation could further enhance robustness.

A plausible implication is that simulation-based data acquisition, when combined with strong backbone models and rigorous prompt/formats, offers an efficient route to general-purpose tool-using LLMs on modest scales (Tang et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ToolAlpaca-13B.