CompSkillBench: Benchmark for Compositional Skill Routing
- CompSkillBench is a benchmark for compositional skill routing in LLM agents, requiring query decomposition, skill retrieval, and plan composition over a large pool of real MCP skills.
- It employs detailed metrics such as Decomposition Accuracy, Skill Recall@k, and Category Recall to diagnose granularity errors and retrieval performance.
- The benchmark facilitates realistic evaluation of multi-step, dependency-aware workflows, advancing tool-using LLM agent research through controlled, compositional task assessment.
CompSkillBench is a benchmark for compositional skill routing in tool-using LLM agents, introduced with the SkillWeaver framework in "Compositional Skill Routing for LLM Agents: Decompose, Retrieve, and Compose" (Gao, 16 Jun 2026). It formalizes the setting in which an agent receives a complex user query and must operate over a large library of real, reusable skills by decomposing the query into atomic sub-tasks, retrieving an appropriate skill for each sub-task, and composing an executable plan. The benchmark contains 300 compositional queries over 2,209 real MCP server skills spanning 24 functional categories, all sourced from the public MCP ecosystem, and is designed to isolate decomposition granularity, step-wise retrieval quality, and multi-step workflow structure in a large-skill-pool regime.
1. Problem formulation and motivation
CompSkillBench was created to evaluate LLM agents on compositional skill routing: given a complex query and a large library of real skills, the agent must perform three operations: decompose the task into atomic sub-tasks, retrieve appropriate skills per sub-task, and compose an executable plan (Gao, 16 Jun 2026). This formulation departs from prior tool-use evaluations that largely assume single-tool selection or small, fixed tool inventories. The benchmark therefore targets a setting closer to contemporary agent ecosystems, where thousands of community-maintained skills coexist and realistic tasks require multi-skill compositions rather than flat tool invocation.
The benchmark addresses two gaps. The first is scale and realism: it uses thousands of real MCP server skills with authentic metadata and taxonomy rather than synthetic tool descriptions or tiny curated toolsets. The second is compositionality and structure: it evaluates step-wise retrieval and decomposition granularity for multi-step, dependency-aware workflows. A central design premise is that retrieval errors in such systems are often downstream consequences of incorrect decomposition granularity. The benchmark is therefore not limited to end-to-end success rates; it includes dedicated metrics for diagnosing the cascading bottlenecks in decompose–retrieve–compose pipelines.
This emphasis changes what is being measured. Instead of asking only whether an LLM can invoke a tool, CompSkillBench asks whether it can align a natural-language query with the latent structure of a large skill library. The benchmark thereby treats task decomposition as a first-class evaluand rather than an unobserved intermediate artifact.
2. Dataset construction and taxonomy
The skill corpus in CompSkillBench consists of 2,209 unique skills sourced from the public MCP ecosystem, specifically the awesome-mcp-servers registry (Gao, 16 Jun 2026). The curation pipeline is explicitly specified:
- Extraction: 2,228 server entries were parsed, including name, description, category, and repository URL.
- Quality filtering: entries with descriptions shorter than 15 characters or consisting mostly of badges were removed, reducing the corpus to 2,213.
- Deduplication: identical normalized names were merged, yielding 2,209 unique skills.
- Categorization: 49 original tags were mapped to 24 canonical categories via a curated mapping.
Only categories with at least 5 skills are used in queries, spanning 23 categories in practice. The benchmark taxonomy comprises the following 24 functional categories: developer-tools, finance, integrations, knowledge-management, search-extraction, security, communication, databases, cloud-infrastructure, code-execution, productivity, gaming-entertainment, data-processing, location-services, browser-automation, marketing-analytics, monitoring-observability, ai-ml, multimedia, science-research, file-management, e-commerce, legal-compliance, and data-visualization.
Top categories by skill count include developer-tools (357), finance (270), integrations (229), knowledge-management (180), search-extraction (140), security (122), communication (109), databases (104), cloud-infrastructure (87), and code-execution (69). Representative skills listed in the benchmark description include firecrawl and serper-mcp for search-extraction, slack-mcp and email-server for communication, postgres-mcp and redis-server for databases, and aws-mcp and terraform-server for cloud-infrastructure.
The taxonomy is not merely organizational metadata. It is also the basis for the benchmark’s category-aware retrieval metrics, particularly Category Recall@k, which acknowledges that multiple skills within the same category may be functionally interchangeable for a given step.
3. Query construction and compositional structure
CompSkillBench contains 300 synthetic queries designed to require multi-skill compositions across categories (Gao, 16 Jun 2026). The difficulty partition is fixed by construction.
| Difficulty | Queries | Typical complexity |
|---|---|---|
| Easy | 150 | 2 skills from 2 categories |
| Medium | 100 | 3 skills from 3 categories |
| Hard | 50 | 4–5 skills from 4–5 categories |
The queries are template-driven, built by combining verb phrases across categories, such as “query the database” and “send a notification.” Ground-truth step descriptions are intentionally written so as not to copy skill names, which forces semantic retrieval rather than lexical matching. Each query is annotated with ground-truth sub-task descriptions, skill IDs, required categories, and an execution order.
The dominant execution pattern is a sequential chain, but the framework associated with the benchmark can also detect parallelism and produce a dependency-aware DAG when dependencies allow. An illustrative example is the query: “Download the dataset, transform it, and create visual reports.” Its ground-truth decomposition is a three-step chain: download dataset, transform data, create reports. The candidate categories for these steps include search-extraction or file-management for acquisition, data-processing for transformation, and data-visualization or marketing-analytics for reporting. The associated dependency structure is linear: the download feeds transformation, and transformation feeds visualization.
Although the queries are synthetic, they are designed to operationalize a specific evaluation target: semantic alignment between decomposed natural-language steps and real skills in a large, noisy ecosystem. The synthetic construction is therefore a control mechanism for benchmarking, not an attempt to imitate raw user logs.
4. Evaluation protocol and metrics
CompSkillBench evaluates systems at three granularities: step-level retrieval, chain-level composition, and decomposition granularity (Gao, 16 Jun 2026). Let be the number of queries. Query has ground-truth steps, and the total number of steps is
The step-level retrieval metrics are defined as follows. Skill Recall@k measures whether the exact ground-truth skill is present in the top- retrieved candidates for a step:
Category Recall@k is a relaxed metric that checks whether any candidate from the correct category appears in the top-:
This category-based relaxation is motivated by the fact that many skills in a category are functionally interchangeable.
The chain-level metrics are Chain Exact Match, the fraction of queries for which every step’s selected skill exactly matches the ground truth,
and Chain Category Match, reported as Chain, the average fraction of steps per query whose selected skill lies in the ground-truth category:
For decomposition, the benchmark defines Decomposition Accuracy (DA) as strict step-count accuracy:
0
It also defines a relaxed variant,
1
A crucial clarification is that, in CompSkillBench, DA denotes exact step-count agreement only. It does not score dependency edges, and ordering and dependency evaluation are not part of DA. This is one of the benchmark’s most important interpretive constraints.
The paper also reports a pilot execution study using mock executors. In a 30-query pilot with an average of 2.80 predicted steps, Step Execution Success (SES) is 86.9% (73/84 steps) and Chain Completion Rate (CCR) is 76.7% (23/30 chains). These execution results are supplementary: the main evaluation target remains decomposition and retrieval.
5. SkillWeaver and empirical findings on CompSkillBench
The benchmark is paired with SkillWeaver, a decompose–retrieve–compose framework used to probe performance on CompSkillBench (Gao, 16 Jun 2026). In the decompose stage, an instruction-tuned LLM, Qwen2.5-7B-Instruct, produces an ordered list of atomic sub-tasks. The paper’s main intervention is Iterative Skill-Aware Decomposition (SAD), a retrieval-augmented feedback loop in which top-2 retrieved skill names and descriptions are fed back as hints to refine decomposition granularity; the default is one iteration with 3.
In the retrieve stage, a bi-encoder retriever based on all-MiniLM-L6-v2 encodes sub-task text and skill metadata into 384-dimensional embeddings. The vectors are 4-normalized and indexed with FAISS IndexFlatIP for exact inner-product top-5 search, with 6 by default. Two skill representations are considered: metadata-only, using name and description, and body-aware, which can optionally include up to 2,000 characters of the specification body.
In the compose stage, SkillWeaver selects a skill per step by combining similarity with inter-step compatibility and detects dependencies to form a DAG. The empirical emphasis, however, is on the earlier stages: the paper identifies decomposition quality as the primary bottleneck.
Under the default setting—Qwen2.5-7B decomposition, MiniLM retrieval, 7, and the full 2,209-skill library—vanilla decomposition without hints yields 8, 9, 0, 1, and 2. With SAD in one iteration, the results become 3, 4, 5, 6, and 7. The gain in DA is reported as 8 relative, with a Wilcoxon signed-rank test 9; the appendix reports 0. A bootstrap 95% confidence interval for 1 is 2.
The most important analytical result is the DA-conditioned analysis. Conditioning on queries where 3, 4 rises from 34.2% to 41.2%, and 5 rises to 81.6%. The paper interprets this as evidence that correct granularity is a prerequisite for effective retrieval. The difficulty breakdown follows the same pattern: SAD improves DA for Easy queries from 44.7% to 63.3%, for Medium from 66.0% to 78.0%, and for Hard from 40.0% to 60.0%, while CatR@1 gains remain comparatively modest.
Further observations reinforce the same diagnosis. Metadata-only retrieval achieves 6, indicating that concise documentation is sufficient to surface relevant candidates. A listwise LLM reranker pilot applied to the top-10 further raises 7 from 0.371 to 0.409, a 8 relative gain with 9, suggesting that the remaining top-10 to top-1 gap is representational and can be reduced by reranking.
The framework also yields large context savings. Serializing all 2,209 skills would consume approximately 884k tokens, assuming about 400 tokens per skill. Retrieval reduces this to exposure to 10 skills, about 4,000 tokens, a reduction of approximately 99.5%. SkillWeaver exposes on average 2.9 skills per query, about 1,160 tokens, for an approximately 99.9% reduction. SAD’s hint list for the second pass adds about 1,100 tokens once per query during decomposition rather than at execution time.
Transfer experiments are used to assess generalization. In a leave-two-categories-out setting where security and code-execution are removed, DA on affected queries (0) improves from 0.452 to 0.613, a 1 relative gain. In an 80/20 skill hold-out setting on affected queries (2), DA improves from 0.560 to 0.690, a 3 relative gain. The paper states that these results indicate SAD’s gains stem from vocabulary and granularity alignment rather than overfitting to specific skills or categories.
6. Reproducibility, limitations, and position among related benchmarks
CompSkillBench is reported as a single 300-query benchmark partitioned by difficulty rather than by train/dev/test split, because the evaluation targets zero-shot routing with retrieval (Gao, 16 Jun 2026). The paper does not provide an explicit dataset URL or license for the compiled benchmark in the text. Because the skills are sourced from a public MCP registry and the curation pipeline is specified, the benchmark can be reconstructed from the public registry using the described filters and category mappings.
The retrieval pipeline is also specified in enough detail to support reproduction. The paper describes five steps: build the skill corpus using each skill’s name, description, category or categories, and optionally up to 2,000 characters of body or specification; map tags to the 24 canonical categories; encode sub-task text and skill representations with all-MiniLM-L6-v2; 4-normalize the embeddings; index them with FAISS IndexFlatIP; retrieve top-5 with 6 by default; and, for SAD, union per-step top candidates, select the top-7 hints with 8, and re-decompose with the hint list in the prompt. Reported index construction time is about 15 seconds, and retrieval is reported at less than 15 ms per batch.
Several limitations are explicit. The queries are template-generated and synthetic, which may introduce patterns, although the paper states that paraphrase and human-style evaluations, together with transfer experiments, suggest robustness. The benchmark’s decomposition metric evaluates step-count agreement only and does not score dependency correctness or compatibility. Full execution success is not the primary evaluation target, and the pilot execution study is limited to 30 queries. The hard subset contains only 50 queries. SAD adds an extra LLM pass, approximately doubling decomposition latency. The formulation also assumes a one-to-one mapping between steps and skills; extending to many-to-many compositions or skill parameterization is identified as future work.
Relative to adjacent tool-use benchmarks, the paper contrasts CompSkillBench with API-Bank, ToolQA, TaskBench, and MetaTool, which focus on tool use with small or fixed inventories and often single-step or flat task structures. It also contrasts the benchmark with CRAFT, which constructs specialized toolsets per query but does not require explicit multi-step decomposition or dependency-aware composition. CompSkillBench is characterized as unique in combining a large real skill pool, explicit compositional tasks requiring multiple skills and ordered execution, and decomposition-aware and category-aware retrieval metrics such as DA and 9.
In the paper’s own summary, CompSkillBench provides the first controlled, compositional, large-skill-pool benchmark for analyzing and improving the decompose–retrieve–compose loop in tool-using LLM agents, grounded in thousands of real MCP skills and metrics that isolate decomposition granularity and step-wise retrieval quality.