Papers
Topics
Authors
Recent
Search
2000 character limit reached

CompSkillBench: Benchmark for Compositional Skill Routing

Updated 4 July 2026
  • CompSkillBench is a benchmark for compositional skill routing in LLM agents, requiring query decomposition, skill retrieval, and plan composition over a large pool of real MCP skills.
  • It employs detailed metrics such as Decomposition Accuracy, Skill Recall@k, and Category Recall to diagnose granularity errors and retrieval performance.
  • The benchmark facilitates realistic evaluation of multi-step, dependency-aware workflows, advancing tool-using LLM agent research through controlled, compositional task assessment.

CompSkillBench is a benchmark for compositional skill routing in tool-using LLM agents, introduced with the SkillWeaver framework in "Compositional Skill Routing for LLM Agents: Decompose, Retrieve, and Compose" (Gao, 16 Jun 2026). It formalizes the setting in which an agent receives a complex user query and must operate over a large library of real, reusable skills by decomposing the query into atomic sub-tasks, retrieving an appropriate skill for each sub-task, and composing an executable plan. The benchmark contains 300 compositional queries over 2,209 real MCP server skills spanning 24 functional categories, all sourced from the public MCP ecosystem, and is designed to isolate decomposition granularity, step-wise retrieval quality, and multi-step workflow structure in a large-skill-pool regime.

1. Problem formulation and motivation

CompSkillBench was created to evaluate LLM agents on compositional skill routing: given a complex query and a large library of real skills, the agent must perform three operations: decompose the task into atomic sub-tasks, retrieve appropriate skills per sub-task, and compose an executable plan (Gao, 16 Jun 2026). This formulation departs from prior tool-use evaluations that largely assume single-tool selection or small, fixed tool inventories. The benchmark therefore targets a setting closer to contemporary agent ecosystems, where thousands of community-maintained skills coexist and realistic tasks require multi-skill compositions rather than flat tool invocation.

The benchmark addresses two gaps. The first is scale and realism: it uses thousands of real MCP server skills with authentic metadata and taxonomy rather than synthetic tool descriptions or tiny curated toolsets. The second is compositionality and structure: it evaluates step-wise retrieval and decomposition granularity for multi-step, dependency-aware workflows. A central design premise is that retrieval errors in such systems are often downstream consequences of incorrect decomposition granularity. The benchmark is therefore not limited to end-to-end success rates; it includes dedicated metrics for diagnosing the cascading bottlenecks in decompose–retrieve–compose pipelines.

This emphasis changes what is being measured. Instead of asking only whether an LLM can invoke a tool, CompSkillBench asks whether it can align a natural-language query with the latent structure of a large skill library. The benchmark thereby treats task decomposition as a first-class evaluand rather than an unobserved intermediate artifact.

2. Dataset construction and taxonomy

The skill corpus in CompSkillBench consists of 2,209 unique skills sourced from the public MCP ecosystem, specifically the awesome-mcp-servers registry (Gao, 16 Jun 2026). The curation pipeline is explicitly specified:

  1. Extraction: 2,228 server entries were parsed, including name, description, category, and repository URL.
  2. Quality filtering: entries with descriptions shorter than 15 characters or consisting mostly of badges were removed, reducing the corpus to 2,213.
  3. Deduplication: identical normalized names were merged, yielding 2,209 unique skills.
  4. Categorization: 49 original tags were mapped to 24 canonical categories via a curated mapping.

Only categories with at least 5 skills are used in queries, spanning 23 categories in practice. The benchmark taxonomy comprises the following 24 functional categories: developer-tools, finance, integrations, knowledge-management, search-extraction, security, communication, databases, cloud-infrastructure, code-execution, productivity, gaming-entertainment, data-processing, location-services, browser-automation, marketing-analytics, monitoring-observability, ai-ml, multimedia, science-research, file-management, e-commerce, legal-compliance, and data-visualization.

Top categories by skill count include developer-tools (357), finance (270), integrations (229), knowledge-management (180), search-extraction (140), security (122), communication (109), databases (104), cloud-infrastructure (87), and code-execution (69). Representative skills listed in the benchmark description include firecrawl and serper-mcp for search-extraction, slack-mcp and email-server for communication, postgres-mcp and redis-server for databases, and aws-mcp and terraform-server for cloud-infrastructure.

The taxonomy is not merely organizational metadata. It is also the basis for the benchmark’s category-aware retrieval metrics, particularly Category Recall@k, which acknowledges that multiple skills within the same category may be functionally interchangeable for a given step.

3. Query construction and compositional structure

CompSkillBench contains 300 synthetic queries designed to require multi-skill compositions across categories (Gao, 16 Jun 2026). The difficulty partition is fixed by construction.

Difficulty Queries Typical complexity
Easy 150 2 skills from 2 categories
Medium 100 3 skills from 3 categories
Hard 50 4–5 skills from 4–5 categories

The queries are template-driven, built by combining verb phrases across categories, such as “query the database” and “send a notification.” Ground-truth step descriptions are intentionally written so as not to copy skill names, which forces semantic retrieval rather than lexical matching. Each query is annotated with ground-truth sub-task descriptions, skill IDs, required categories, and an execution order.

The dominant execution pattern is a sequential chain, but the framework associated with the benchmark can also detect parallelism and produce a dependency-aware DAG when dependencies allow. An illustrative example is the query: “Download the dataset, transform it, and create visual reports.” Its ground-truth decomposition is a three-step chain: download dataset, transform data, create reports. The candidate categories for these steps include search-extraction or file-management for acquisition, data-processing for transformation, and data-visualization or marketing-analytics for reporting. The associated dependency structure is linear: the download feeds transformation, and transformation feeds visualization.

Although the queries are synthetic, they are designed to operationalize a specific evaluation target: semantic alignment between decomposed natural-language steps and real skills in a large, noisy ecosystem. The synthetic construction is therefore a control mechanism for benchmarking, not an attempt to imitate raw user logs.

4. Evaluation protocol and metrics

CompSkillBench evaluates systems at three granularities: step-level retrieval, chain-level composition, and decomposition granularity (Gao, 16 Jun 2026). Let QQ be the number of queries. Query qq has KqK_q ground-truth steps, and the total number of steps is

N=qKq.N = \sum_q K_q.

The step-level retrieval metrics are defined as follows. Skill Recall@k measures whether the exact ground-truth skill is present in the top-kk retrieved candidates for a step:

R@k=1Ni=1N1[ground-truth skill for step itop-k candidates].R@k = \frac{1}{N} \sum_{i=1}^{N} \mathbf{1}[\text{ground-truth skill for step } i \in \text{top-}k \text{ candidates}].

Category Recall@k is a relaxed metric that checks whether any candidate from the correct category appears in the top-kk:

CatR@k=1Ni=1N1[ground-truth category for step icategories(top-k candidates for step i)].CatR@k = \frac{1}{N} \sum_{i=1}^{N} \mathbf{1}[\text{ground-truth category for step } i \in \text{categories(top-}k \text{ candidates for step } i)].

This category-based relaxation is motivated by the fact that many skills in a category are functionally interchangeable.

The chain-level metrics are Chain Exact Match, the fraction of queries for which every step’s selected skill exactly matches the ground truth,

ChainExact=1Qq1[all steps in q select the exact ground-truth skill],ChainExact = \frac{1}{Q} \sum_q \mathbf{1}[\text{all steps in } q \text{ select the exact ground-truth skill}],

and Chain Category Match, reported as Chain, the average fraction of steps per query whose selected skill lies in the ground-truth category:

Chain=1Qq1Kqj1[selected skill category for (q,j)=ground-truth category].Chain = \frac{1}{Q} \sum_q \frac{1}{K_q} \sum_j \mathbf{1}[\text{selected skill category for } (q,j) = \text{ground-truth category}].

For decomposition, the benchmark defines Decomposition Accuracy (DA) as strict step-count accuracy:

qq0

It also defines a relaxed variant,

qq1

A crucial clarification is that, in CompSkillBench, DA denotes exact step-count agreement only. It does not score dependency edges, and ordering and dependency evaluation are not part of DA. This is one of the benchmark’s most important interpretive constraints.

The paper also reports a pilot execution study using mock executors. In a 30-query pilot with an average of 2.80 predicted steps, Step Execution Success (SES) is 86.9% (73/84 steps) and Chain Completion Rate (CCR) is 76.7% (23/30 chains). These execution results are supplementary: the main evaluation target remains decomposition and retrieval.

5. SkillWeaver and empirical findings on CompSkillBench

The benchmark is paired with SkillWeaver, a decompose–retrieve–compose framework used to probe performance on CompSkillBench (Gao, 16 Jun 2026). In the decompose stage, an instruction-tuned LLM, Qwen2.5-7B-Instruct, produces an ordered list of atomic sub-tasks. The paper’s main intervention is Iterative Skill-Aware Decomposition (SAD), a retrieval-augmented feedback loop in which top-qq2 retrieved skill names and descriptions are fed back as hints to refine decomposition granularity; the default is one iteration with qq3.

In the retrieve stage, a bi-encoder retriever based on all-MiniLM-L6-v2 encodes sub-task text and skill metadata into 384-dimensional embeddings. The vectors are qq4-normalized and indexed with FAISS IndexFlatIP for exact inner-product top-qq5 search, with qq6 by default. Two skill representations are considered: metadata-only, using name and description, and body-aware, which can optionally include up to 2,000 characters of the specification body.

In the compose stage, SkillWeaver selects a skill per step by combining similarity with inter-step compatibility and detects dependencies to form a DAG. The empirical emphasis, however, is on the earlier stages: the paper identifies decomposition quality as the primary bottleneck.

Under the default setting—Qwen2.5-7B decomposition, MiniLM retrieval, qq7, and the full 2,209-skill library—vanilla decomposition without hints yields qq8, qq9, KqK_q0, KqK_q1, and KqK_q2. With SAD in one iteration, the results become KqK_q3, KqK_q4, KqK_q5, KqK_q6, and KqK_q7. The gain in DA is reported as KqK_q8 relative, with a Wilcoxon signed-rank test KqK_q9; the appendix reports N=qKq.N = \sum_q K_q.0. A bootstrap 95% confidence interval for N=qKq.N = \sum_q K_q.1 is N=qKq.N = \sum_q K_q.2.

The most important analytical result is the DA-conditioned analysis. Conditioning on queries where N=qKq.N = \sum_q K_q.3, N=qKq.N = \sum_q K_q.4 rises from 34.2% to 41.2%, and N=qKq.N = \sum_q K_q.5 rises to 81.6%. The paper interprets this as evidence that correct granularity is a prerequisite for effective retrieval. The difficulty breakdown follows the same pattern: SAD improves DA for Easy queries from 44.7% to 63.3%, for Medium from 66.0% to 78.0%, and for Hard from 40.0% to 60.0%, while CatR@1 gains remain comparatively modest.

Further observations reinforce the same diagnosis. Metadata-only retrieval achieves N=qKq.N = \sum_q K_q.6, indicating that concise documentation is sufficient to surface relevant candidates. A listwise LLM reranker pilot applied to the top-10 further raises N=qKq.N = \sum_q K_q.7 from 0.371 to 0.409, a N=qKq.N = \sum_q K_q.8 relative gain with N=qKq.N = \sum_q K_q.9, suggesting that the remaining top-10 to top-1 gap is representational and can be reduced by reranking.

The framework also yields large context savings. Serializing all 2,209 skills would consume approximately 884k tokens, assuming about 400 tokens per skill. Retrieval reduces this to exposure to 10 skills, about 4,000 tokens, a reduction of approximately 99.5%. SkillWeaver exposes on average 2.9 skills per query, about 1,160 tokens, for an approximately 99.9% reduction. SAD’s hint list for the second pass adds about 1,100 tokens once per query during decomposition rather than at execution time.

Transfer experiments are used to assess generalization. In a leave-two-categories-out setting where security and code-execution are removed, DA on affected queries (kk0) improves from 0.452 to 0.613, a kk1 relative gain. In an 80/20 skill hold-out setting on affected queries (kk2), DA improves from 0.560 to 0.690, a kk3 relative gain. The paper states that these results indicate SAD’s gains stem from vocabulary and granularity alignment rather than overfitting to specific skills or categories.

CompSkillBench is reported as a single 300-query benchmark partitioned by difficulty rather than by train/dev/test split, because the evaluation targets zero-shot routing with retrieval (Gao, 16 Jun 2026). The paper does not provide an explicit dataset URL or license for the compiled benchmark in the text. Because the skills are sourced from a public MCP registry and the curation pipeline is specified, the benchmark can be reconstructed from the public registry using the described filters and category mappings.

The retrieval pipeline is also specified in enough detail to support reproduction. The paper describes five steps: build the skill corpus using each skill’s name, description, category or categories, and optionally up to 2,000 characters of body or specification; map tags to the 24 canonical categories; encode sub-task text and skill representations with all-MiniLM-L6-v2; kk4-normalize the embeddings; index them with FAISS IndexFlatIP; retrieve top-kk5 with kk6 by default; and, for SAD, union per-step top candidates, select the top-kk7 hints with kk8, and re-decompose with the hint list in the prompt. Reported index construction time is about 15 seconds, and retrieval is reported at less than 15 ms per batch.

Several limitations are explicit. The queries are template-generated and synthetic, which may introduce patterns, although the paper states that paraphrase and human-style evaluations, together with transfer experiments, suggest robustness. The benchmark’s decomposition metric evaluates step-count agreement only and does not score dependency correctness or compatibility. Full execution success is not the primary evaluation target, and the pilot execution study is limited to 30 queries. The hard subset contains only 50 queries. SAD adds an extra LLM pass, approximately doubling decomposition latency. The formulation also assumes a one-to-one mapping between steps and skills; extending to many-to-many compositions or skill parameterization is identified as future work.

Relative to adjacent tool-use benchmarks, the paper contrasts CompSkillBench with API-Bank, ToolQA, TaskBench, and MetaTool, which focus on tool use with small or fixed inventories and often single-step or flat task structures. It also contrasts the benchmark with CRAFT, which constructs specialized toolsets per query but does not require explicit multi-step decomposition or dependency-aware composition. CompSkillBench is characterized as unique in combining a large real skill pool, explicit compositional tasks requiring multiple skills and ordered execution, and decomposition-aware and category-aware retrieval metrics such as DA and kk9.

In the paper’s own summary, CompSkillBench provides the first controlled, compositional, large-skill-pool benchmark for analyzing and improving the decompose–retrieve–compose loop in tool-using LLM agents, grounded in thousands of real MCP skills and metrics that isolate decomposition granularity and step-wise retrieval quality.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CompSkillBench.