- The paper introduces SkillWeaver, a framework that decomposes complex queries, retrieves relevant skills, and composes these into executable pipelines.
- It demonstrates that iterative Skill-Aware Decomposition significantly boosts decomposition accuracy and retrieval performance across 2,209+ skills.
- The approach reduces context token usage drastically while paving the way for integrating reranking to overcome remaining representation bottlenecks.
Compositional Skill Routing for LLM Agents: Decompose, Retrieve, and Compose (2606.18051)
Introduction and Problem Statement
"Compositional Skill Routing for LLM Agents: Decompose, Retrieve, and Compose" rigorously formalizes the compositional skill routing problem in LLM-based agentic architectures. Whereas prior work predominantly addresses single-skill selection in response to a user query, practical scenarios necessitate chains of atomic operations across a skill library with thousands of modular, composable skills—each specified in natural language and annotated with functional metadata.
The central challenge addressed is: Given a complex user query and a large library of structured skills, how can an agent system efficiently and accurately (1) decompose the query, (2) retrieve relevant skills, and (3) compose them into an executable pipeline? The inadequacy of monolithic, one-skill-per-query approaches is demonstrated, and a three-stage pipeline—SkillWeaver—is introduced for the general compositional routing setting.
Model Architecture
Decompose
SkillWeaver employs a task decomposition module implemented as an instruction-tuned LLM (Qwen2.5-7B-Instruct in primary experiments), outputting a sequence of atomic sub-tasks mapped one-to-one to skills. The output format is a strictly-structured, ordered array, with the target property that each element directly corresponds to a specialized skill in the library.
Retrieve
Given decomposed sub-tasks, the framework leverages a bi-encoder architecture (all-MiniLM-L6-v2; 384d) for semantic retrieval over skill metadata (name, description, truncated body), indexed with FAISS for sub-20ms candidate selection among 2,209+ skills. Both metadata-only and body-aware encoding variants are evaluated, with the finding that metadata suffices for high recall.
Compose
Given top-k retrieved candidates for each sub-task, a dependency-aware DAG planner finalizes skill assignment, factoring both retrieval score and pairwise inter-skill compatibility via a convex combination of semantic similarity and compatibility heuristics (I/O type, taxonomy, and keyword overlap). Output is an executable skill chain (DAG), ready for orchestration by the agent runtime.
Skill-Aware Decomposition (SAD)
A core contribution is the Iterative Skill-Aware Decomposition (SAD) procedure: SAD establishes an LLM-driven, retrieval-augmented feedback loop in which retrieved skill "hints" from the first decomposition are explicitly injected into a second-pass decomposition prompt. This process iterates until the extracted sub-task sequence aligns at the vocabulary granularity required by the skill pool. The SAD loop often converges in a single iteration.
Experimental Protocol and Benchmark
To standardize evaluation, CompSkillBench is introduced—a compositional skill routing benchmark comprising 300 multi-step, multi-category queries over 2,209 real public MCP server skills (24 categories). Each query comprises 2–5 ground-truth steps, with dense annotation at category and skill level, and is constructed to avoid lexical leakage between queries and skill names/descriptions. Evaluation addresses not only step- and chain-level recall (Skill and CatR@k) but, crucially, Decomposition Accuracy (DA/DA±1​).
Empirical Findings
Bottleneck Analysis
Systematic analysis demonstrates that task decomposition is the primary bottleneck in compositional skill routing. Baseline LLM decomposition achieves only 51.0% DA and 34.2% CatR@1, far from the retrieval ceiling (CatR@10 ≈ 70%). Conditioning on correct decomposition (DA=1) raises CatR@1 to 41.2%. Thus, improving decomposition granularity, rather than retrieval score or encoder scale, is the dominant lever.
SAD Effectiveness
Inserting a single SAD iteration improves decomposition accuracy from 51.0% to 67.7% (+32.7% relative; Wilcoxon p<10−6) and CatR@1 to 37.0% (+8.2%). The effect is robust to paraphrase, holds across all difficulty levels (with highest gains on the hardest queries), and generalizes to unseen categories and held-out skills (+35.6% DA on held-out categories). SAD acts primarily as a granularity corrector: per-step retrieval improvement is contingent on the decomposition reaching the correct sub-task granularity.
Reranking and Encoder Independence
A pilot with a listwise reranker (Qwen2.5-7B) further lifts CatR@1 from 37.1% to 40.9% (+10.3% relative), demonstrating that the remaining gap from top-10 to top-1 retrieval is a representation/ranking bottleneck rather than further decomposition quality. Encoder replacement (BGE-base-en-v1.5) also produces orthogonal improvements. Thus, SkillWeaver’s performance can be further enhanced by integrating larger sentence encoders or fine-tuned cross-encoders.
Efficiency
SkillWeaver reduces the average context window token usage for tool selection by 99.9% (from ∼884K to ∼1,160 tokens per query), enabling scalable execution on resource-constrained platforms.
Execution Pilot
A mock execution pilot (10 categories, 30 queries) validates the end-to-end viability of SAD-generated skill DAGs: 76.7% chain completion, 86.9% step execution rate.
Implications, Limitations, and Future Outlook
This work demonstrates that granular, retrieval-augmented, and composition-aware skill routing is tractable at scale (2,209+ skills) given LLM-driven decomposers augmented with skill-vocabulary feedback. The iterative SAD procedure is numerically validated to close the structural gap between natural user language and the specialized operational vocabulary of skill repositories, a limitation of previous API/tool selection approaches. Additionally, the finding that reranking methods address the residual representation gap delineates clear lines for future work on joint modeling and supervised ranking.
Limitations include reliance on template-generated queries for benchmark construction, restriction to one-to-one sub-task:skill mapping, and the lack of explicit compatibility/execution error signals in the primary evaluation. The compose stage’s evaluation is partial, with further work required for real-world, end-to-end, fault-tolerant execution pipelines.
Longer term, rigorous compositional skill routing presents a foundational layer for "open" agent frameworks that aim to orchestrate workflows over arbitrary modular tools, especially as skill repositories scale toward tens of thousands of heterogeneous, community-authored modules. Agent generalization across novel combinations, robust negative sampling, and grounding to on-the-fly skill ecosystems will hinge on the continued development of routing primitives as advanced as SkillWeaver+SAD.
Conclusion
The SkillWeaver framework establishes a principled pipeline and evaluation methodology for compositional skill routing in LLM agents at library-scale. Through the introduction of the SAD procedure, the main architectural bottleneck—decomposition granularity mismatch between query and skill pool—is algorithmically corrected, yielding significant improvements in modularity, retrieval recall, and execution tractability. Theoretical and empirical analyses clarify that, at the present state-of-the-art, further advances will depend on integrated reranking, multi-hop compatibility modeling, and large-scale human-authored benchmarks for open-ended agent reasoning over modular skills.