Tool-DE: Benchmark for LLM Tool Retrieval

Updated 29 October 2025

Tool-DE is a benchmark and methodological framework that systematically enriches tool documentation using LLM-driven expansion to address sparsity and inconsistency.
It employs a rigorous four-stage pipeline—expansion, judgement, refinement, and human validation—to convert unstructured tool docs into structured, semantically rich profiles.
Empirical results demonstrate significant improvements in metrics like N@10 and R@10, positioning Tool-DE as a state-of-the-art solution for LLM tool retrieval.

Tool-DE refers to a benchmark and methodological framework for addressing the challenge of incomplete and heterogeneous documentation in tool retrieval for LLM agents. Tool-DE systematically enriches tool documentation by leveraging LLMs to augment unstructured or sparse tool descriptions with structured, semantically dense profiles. This results in improved retrieval effectiveness and enables the development of specialized retriever and reranker models that outperform prior approaches on large-scale tool retrieval benchmarks (Lu et al., 26 Oct 2025).

1. Motivation and Problem Setting

The central problem addressed by Tool-DE is the poor quality, sparsity, and inconsistency of documentation for tools—including APIs, code functions, and service endpoints—that limits the retrieval performance of both sparse and dense retrieval models in LLM-augmented tool-use pipelines. In large-scale tool ecosystems, approximately 41.6% of original tool docs lack clear functional information or contextual guidance, impeding natural language matching and semantic alignment between user queries and tool profiles.

Tool retrieval, defined as the process of selecting the most suitable tool(s) for a user query, becomes increasingly non-trivial as toolsets grow in size and heterogeneity. Existing benchmarks and retrieval paradigms often attempt to work around documentation deficiency but do not systematically rectify it at the source. Tool-DE operationalizes the hypothesis that standardization and enrichment of tool documentation via LLM-based expansion can substantially improve downstream retrieval and agent efficacy.

2. Document Expansion Pipeline

The document expansion process in Tool-DE is a four-stage pipeline:

Expansion: An open-source LLM (Qwen3-32B, reasoning mode enabled) generates structured profiles for each tool based strictly on its original documentation. New fields are created:
- function_description: Main function, ≤20 words.
- tags: 3–5 keywords.
- Optionally, when_to_use, limitations, example_usage—included when grounding permits.
- Expansion is strictly doc-grounded; hallucination beyond the source text is not allowed.

Expansion is formalized as:

$d_{\mathrm{expansion}} = d_{\mathrm{original}} \cup d_{\mathrm{profile}}$

Judgement: LLaMa-3.1-70B checks generated profiles for faithfulness and adherence to prescribed JSON format; rule-based checks enforce non-empty canonical fields.
Refinement: Roughly 1.5% of cases failing the judgement stage are re-processed with GPT-4o using identical grounding constraints.
Human Validation: A sample of 100 expansions undergoes human assessment to ensure correctness, completeness, and absence of hallucination; all pass.

The cumulative effect is a robustly enriched documentation corpus, with empirical reduction of incomplete fields from 41.6% (pre-expansion) to 23.5% (post-expansion). Only grounded, verifiable content is retained.

3. Structure of Enriched Tool Profiles

Resulting tool profiles are JSON objects with the following core fields:

"tool_profile": {
  "function": "Filters and retrieves upcoming Chinese films with details like title, date, and region",
  "tags": ["film", "upcoming", "china", "filter", "movie details"],
  "when_to_use": "When planning movie outings or researching new releases in China",
  "limitation": "Region and category parameters must be provided in Chinese"
}

The fields function and tags are always present; when_to_use, limitations, and example_usage are included only when this information is present or derivable from the original documentation. Ablation studies indicate that including all fields is not always optimal—function and tags yield the largest retrieval gains, while example_usage can sometimes reduce performance due to noise or dilution of relevant signal.

4. Datasets and Coverage

Tool-DE comprises two principal corpora, derived from the pre-existing ToolRet meta-benchmark (which aggregates 35 tool use datasets across APIs, code, and customized natural language services):

Tool-Embed-Train: 50,000 query-tool-relevant pairs for training dense retrievers.
Tool-Rank-Train: 200,000 query-tool pairs for LLM-based reranker training.

Domains span:

Web APIs (JSON/OpenAPI, general services),
Code functions (computational, e.g., libraries),
Customized scenario documents (e.g., financial service endpoints, dining bookings).

Coverage analysis demonstrates that LLM-based expansion directly mitigates incompleteness: post-expansion, clear function and scenario fields are present in a much higher fraction of tools.

5. Dedicated Retrieval and Rerank Models

Tool-DE couples the expanded documentation corpus with two models designed and trained on the enriched data:

Tool-Embed (Retriever)

Implementation: Dense retrieval via bi-encoder, using Qwen3-Embedding-0.6B and 4B variants.
Training: InfoNCE contrastive objective (one positive, five negatives per query), single epoch, full parameter tuning via DeepSpeed ZeRO-3.
Input: Query and expanded tool profile; output: embedding in vector space.
Evaluation: NDCG@10, Recall@10, Completeness@10.

Tool-Rank (Reranker)

Implementation: LLM cross-encoder reranker (Qwen3-Reranker-4B), LoRA adaptation ( $r=32, \alpha=64$ , dropout 0.1), LLaMA-Factory backend.
Objective: Cross-entropy for relevance ({true, false}) per query-tool pair.
Scoring:

$P(\text{relevant} \mid q, d) = \frac{\exp(\ell_{\mathrm{true}})}{\exp(\ell_{\mathrm{true}}) + \exp(\ell_{\mathrm{false}})}$

where $\ell_{\mathrm{true}}, \ell_{\mathrm{false}}$ are model logits for true/false tokens.

Ablation models are trained identically on unexpanded docs for direct comparison.

6. Experimental Evaluation and Field Ablation

Experiments on Tool-DE and ToolRet demonstrate that document expansion substantially improves performance for both sparse (BM25) and dense (GritLM-7B, Tool-Embed) retrievers, as well as LLM rerankers.

Key Results:

Tool-Embed-4B achieves N@10=52.23, R@10=63.13, C@10=51.61, setting new state-of-the-art results.
Tool-Rank-4B reranking raised N@10 to 56.44, R@10 by +4.68, and C@10 by +4.99 over first-stage retrieval.
Sparse retrievers (BM25s) benefit: N@10 rises from 36.41 (ToolRet) to 39.35 (Tool-DE); dense retrievers see consistent +2–4 point improvements.

Field-level ablation indicates:

Adding function and tags drives most improvement.
Including example_usage often dilutes relevance.
Removal of noisy or superfluous fields sharpens positive–negative separation during both training and inference.

Expansions also regularize input structure; models trained on expanded docs generalize more robustly to both expanded and non-expanded test docs due to reduced semantic and structural heterogeneity.

7. Broader Impact and Implications

Tool-DE demonstrates that systematic, LLM-powered document expansion is a foundational technique for breaking through longstanding limits in LLM-agent tool retrieval imposed by sparse and inconsistent documentation. By standardizing, enriching, and structurally regularizing tool metadata at scale, Tool-DE enables:

Stronger performance for both sparse and neural retrieval architectures.
Effective LLM-based cross-encoder reranking.
Generalization across diverse tool-using domains (code, web APIs, services).
Immediate applicability as a reproducible benchmark, with released code and data (https://github.com/EIT-NLP/Tool-DE).

Document expansion emerges as a generic, scalable, and domain-agnostic precursor to next-generation LLM tool selection and agent orchestration systems. Strategic selection and curation of expansion fields is critical to maximize signal and avoid performance degradation. Future directions include adaptation to multi-lingual and domain-specific tools, dynamic or query-adaptive expansions, and feedback-driven expansion-refinement loops.

Aspect	Approach	Details
Goal	Improve tool retrieval via enriched, LLM-generated documentation	Standardized profiles; address poor and inconsistent docs
Expansion	4-stage LLM pipeline (Qwen3-32B → LLaMa-3.1-70B → GPT-4o → Human check)	Factual, non-hallucinatory; fields: function, tags, etc.
Datasets	Tool-Embed-Train (50k), Tool-Rank-Train (200k)	Derived from ToolRet, covers web, code, custom domains
Models	Tool-Embed (dense retriever, 0.6B/4B); Tool-Rank (LLM reranker, 4B, LoRA)	Trained on expanded docs; ablations on original vs expanded
Key Findings	Expansion always improves N@10, R@10, C@10; function/tags most informative	SOTA results; field ablation critical for maximizing gains

Tool-DE constitutes a scalable, empirically validated solution for the documentation bottleneck in LLM-agent tool retrieval, providing both a data resource and a methodological foundation for subsequent research into agentic tool selection and orchestration systems (Lu et al., 26 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

Tools are under-documented: Simple Document Expansion Boosts Tool Retrieval (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Tool-DE.