Compositional Skill Routing

Updated 4 July 2026

Compositional skill routing is a modular AI framework that represents reusable skills as explicit units, embeddings, or workflows to execute complex behaviors.
It employs techniques such as differentiable composition, query-conditional routing, and semantic control-flow to optimize transfer, efficiency, and interpretability.
This approach improves performance and safety by enabling granular task decomposition, flexible orchestration, and cost-aware decision-making in AI systems.

Compositional skill routing denotes a family of methods that represent reusable skills as explicit modules, embeddings, workflows, or library entries and then route states, queries, or action sequences through those units so that larger behaviors emerge from their composition rather than from a single monolithic policy. In the literature represented here, the term spans differentiable policy composition in reinforcement learning, query-conditional routing over large skill libraries for LLM agents, mode- and cost-aware orchestration in compound AI systems, semantically structured expert routing in robotic manipulation, and governance of skill updates in compositional robot policies (Sahni et al., 2017, Wang et al., 23 Feb 2026, Deng et al., 22 May 2026, Zheng et al., 23 Mar 2026, Qin et al., 29 Apr 2026).

1. Conceptual scope and problem formulations

A common misconception is that skill routing reduces either to document retrieval or to selecting one option at a time. The literature rejects both simplifications. "Learning to Compose Skills" formulates composition as a differentiable mapping from multiple skill-state embeddings to a composed-task embedding, with the resulting policy interpreted as the policy for the composed task; this is explicitly contrasted with classic HRL and options, which usually choose one option at a time in sequence (Sahni et al., 2017). "Skill Is Not Document" argues that skill retrieval differs fundamentally from traditional document retrieval because top- $K$ joint correctness depends not only on independent query-skill relevance but also on whether the retrieved skills can collaborate under the given query, which it formalizes through a query-conditional skill compatibility term $C(q,S_q^*)$ (Wang et al., 2 Jun 2026). "Generative Skill Composition for LLM Agents" formalizes structured skill composition as a joint decision over which skills, how many, and in what order, implemented as task-conditioned skill sequence prediction (Zhao et al., 30 Jun 2026).

This breadth yields several distinct but related optimization problems. In ComposeNet, the central object is a composition function

$C : \langle \iota_e^{(1)}(S), \dots, \iota_e^{(n)}(S) \rangle \to \iota_c(S),$

where skill-state embeddings are combined into a new embedding consumed by a shared policy head (Sahni et al., 2017). In SkillOrchestra, the orchestrator controls a multi-turn process and optimizes a performance-cost trade-off

$\max_\pi J(\tau)=\mathbb{E}_{\tau\sim\pi}\left[R(\tau)-\lambda\sum_{t=0}^{T} C(A_t,z_t)\right],$

with routing factorized into a mode policy and an agent-routing policy (Wang et al., 23 Feb 2026). In SkillComposer, the output is an executable skill plan

$\hat{\mathbf{z}}=(\hat z_1,\hat z_2,\ldots,\hat z_{\hat n},STOP),$

so subset, count, and order emerge jointly from a single decoding pass (Zhao et al., 30 Jun 2026).

Regime	Skill unit	Routing target
ComposeNet	skill-state embedding	composed policy
SkillOrchestra	mode-specific skill set	agent under cost
SMoDP	semantic skill phase	expert subset
SkillComposer	skill identifier	ordered skill plan
W2S	workflow-bearing skill	node/branch/tool path

This suggests that compositional skill routing is best understood not as a single architecture but as a problem class: selecting, combining, sequencing, or recursively reusing skill-bearing units under structural constraints that are task-dependent and often state-dependent.

2. Skill representations and interfaces

The viability of routing depends on how skills are represented. ComposeNet uses a particularly explicit interface: each skill $k$ has its own trunk $f_k$ producing a skill-state embedding

$\iota_e^{(k)}(s)=f_k(s)\in\mathbb{R}^d,$

while all skills share a single final policy layer $\pi$ . This creates a common semantic space that the policy head can interpret regardless of whether the embedding came from a primitive skill or from a composition layer (Sahni et al., 2017).

SkillOrchestra inserts an intermediate abstraction layer between tasks and agents. A skill is defined as

$\sigma \triangleq \langle \mathcal{D}, \mathcal{I} \rangle,$

where $C(q,S_q^*)$ 0 is a natural-language capability description and $C(q,S_q^*)$ 1 is a set of contextual indicators. These skills are organized in a Skill Handbook graph

$C(q,S_q^*)$ 2

with mode nodes, skill nodes, and agent-profile nodes, so that routing can reason jointly about mode, skill demand, and agent competence (Wang et al., 23 Feb 2026).

Robotic formulations tend to align skill representations with temporal phases and language semantics. SMoDP segments demonstrations into verb-noun skills such as $C(q,S_q^*)$ 3, embeds those labels with a frozen text encoder, and trains a lightweight predictor $C(q,S_q^*)$ 4 to map multimodal context into the same embedding space. Routing is therefore conditioned not on raw latent statistics alone but on a continuous semantic skill embedding (Deng et al., 22 May 2026). SCE similarly builds a persistent skill base $C(q,S_q^*)$ 5 by decomposing demonstrations with state-based rules and grounding segments to reusable skills with a VLM, then uses the resulting skill identity as supervision for runtime routing (Zhang et al., 14 Jun 2026).

LLM-agent work often treats the skill itself as a structured artifact. SkillComposer defines

$C(q,S_q^*)$ 6

with metadata, applicability condition, procedural policy, termination condition, and optional resources; only compact metadata is needed for discovery, while the full procedure is loaded after selection (Zhao et al., 30 Jun 2026). W2S makes this decomposition even more explicit: $C(q,S_q^*)$ 7 where $C(q,S_q^*)$ 8 is a routing header, $C(q,S_q^*)$ 9 is a workflow backbone, $C : \langle \iota_e^{(1)}(S), \dots, \iota_e^{(n)}(S) \rangle \to \iota_c(S),$ 0 is node-level operational semantics, and $C : \langle \iota_e^{(1)}(S), \dots, \iota_e^{(n)}(S) \rangle \to \iota_c(S),$ 1 is the attachment set. Under this view, routing exists both between skills and within a skill’s internal control-flow graph (Zhang et al., 5 Jun 2026).

These representations differ in modality and granularity, but they share a design invariant: a skill must expose a routable interface whose semantics are stable enough to support reuse under new task combinations.

3. Routing mechanisms and architectural families

One architectural family learns composition as dense differentiable fusion. ComposeNet concatenates two embeddings and applies a fully connected composition layer,

$C : \langle \iota_e^{(1)}(S), \dots, \iota_e^{(n)}(S) \rangle \to \iota_c(S),$ 2

after which the same shared policy head used for primitive skills produces the action distribution. Because the policy layer is agnostic to the origin of the embedding, the output of one composition can be fed into another, yielding recursive composition trees and shallow hierarchies (Sahni et al., 2017).

A second family uses explicit discrete or sparse routing. "Routing Networks and the Challenges of Modular and Compositional Computation" treats a module library $C : \langle \iota_e^{(1)}(S), \dots, \iota_e^{(n)}(S) \rangle \to \iota_c(S),$ 3 as reusable skills and introduces a router $C : \langle \iota_e^{(1)}(S), \dots, \iota_e^{(n)}(S) \rangle \to \iota_c(S),$ 4 that chooses which module to apply next or whether to terminate. The resulting computation is a path through module space, but joint learning of modules and routes introduces module collapse, non-stationarity, difficult credit assignment, and an exploration–interference trade-off; value-based RL was found to outperform policy-gradient and reparameterization methods in this setting (Rosenbaum et al., 2019). "Block-Operations: Using Modular Routing to Improve Compositional Generalization" addresses a related issue by splitting activations into fixed-size blocks and routing them with a Multiplexer,

$C : \langle \iota_e^{(1)}(S), \dots, \iota_e^{(n)}(S) \rangle \to \iota_c(S),$ 5

so that whole blocks can be copied, permuted, or modified with low representational distortion (Dietz et al., 2024).

A third family emphasizes parse-tree structure. "Learning Compositional Structures for Deep Learning: Why Routing-by-agreement is Necessary" interprets capsule networks as implementing an And–Or grammar in which routing coefficients act as probabilities over OR-rules. The key control variable is the entropy of the routing coefficients: $C : \langle \iota_e^{(1)}(S), \dots, \iota_e^{(n)}(S) \rangle \to \iota_c(S),$ 6 Low-entropy routing makes each lower-level capsule effectively choose one parent, approximating a tree; high-entropy routing resembles the distributed sharing of CNNs and degrades sensitivity to changes in compositional structure (Venkatraman et al., 2020).

Robotic MoE systems instantiate routing at the temporal-phase level. SMoDP predicts a semantic skill embedding, projects it to a skill token, replaces the action tokens in the router input with that skill token, and broadcasts the resulting logits to all action tokens in the current chunk, so that routing is chunk-consistent. Top- $C : \langle \iota_e^{(1)}(S), \dots, \iota_e^{(n)}(S) \rangle \to \iota_c(S),$ 7 gating then activates a sparse expert subset for the full action chunk (Deng et al., 22 May 2026). SCE splits decoder adaptation into an Execution Expert Branch and a Transition Expert Branch: the dominant skill is selected by hard routing

$C : \langle \iota_e^{(1)}(S), \dots, \iota_e^{(n)}(S) \rangle \to \iota_c(S),$ 8

while transitions are modeled by a soft mixture

$C : \langle \iota_e^{(1)}(S), \dots, \iota_e^{(n)}(S) \rangle \to \iota_c(S),$ 9

and an adaptive coefficient $\max_\pi J(\tau)=\mathbb{E}_{\tau\sim\pi}\left[R(\tau)-\lambda\sum_{t=0}^{T} C(A_t,z_t)\right],$ 0 fuses the two (Zhang et al., 14 Jun 2026).

These mechanisms differ in whether they are dense or sparse, recursive or stepwise, and explicit or implicit in their structure. What they share is the attempt to convert reusable local competence into globally coherent behavior by making routing itself a learnable object rather than an external rule system.

4. Agentic orchestration, retrieval, and workflow routing

In LLM-agent systems, routing shifts from internal latent modules to explicit libraries of tools and skills. SkillOrchestra frames orchestration as a retrieve–decide–execute loop over a query-conditional Skill Handbook. At each turn, the orchestrator selects an operational mode $\max_\pi J(\tau)=\mathbb{E}_{\tau\sim\pi}\left[R(\tau)-\lambda\sum_{t=0}^{T} C(A_t,z_t)\right],$ 1, infers an active subset of skills $\max_\pi J(\tau)=\mathbb{E}_{\tau\sim\pi}\left[R(\tau)-\lambda\sum_{t=0}^{T} C(A_t,z_t)\right],$ 2, and chooses the agent that maximizes weighted competence minus cost: $\max_\pi J(\tau)=\mathbb{E}_{\tau\sim\pi}\left[R(\tau)-\lambda\sum_{t=0}^{T} C(A_t,z_t)\right],$ 3 Because competence is modeled per skill and per agent, routing can vary across turns as task demands evolve (Wang et al., 23 Feb 2026).

Retrieval-oriented work makes the front end itself query-conditional and compatibility-aware. R3-Skill argues that skill retrieval is not document retrieval because skills retrieved together must collaborate under the same query, and it formalizes this with a compatibility factor $\max_\pi J(\tau)=\mathbb{E}_{\tau\sim\pi}\left[R(\tau)-\lambda\sum_{t=0}^{T} C(A_t,z_t)\right],$ 4. Its R3-Embedding + R3-Reranker pipeline uses rejected multi-skill combinations as explicit supervision for what should not be jointly retrieved, with graded listwise reranking improving Set-Compat (Wang et al., 2 Jun 2026). SkillRouter studies routing at a larger scale, over approximately $\max_\pi J(\tau)=\mathbb{E}_{\tau\sim\pi}\left[R(\tau)-\lambda\sum_{t=0}^{T} C(A_t,z_t)\right],$ 5K skills, and shows that the full implementation body is the decisive signal: removing it causes 29–44 percentage point degradation across retrieval methods, while cross-encoder attention concentrates 91.7% on the body field (Zheng et al., 23 Mar 2026).

SkillComposer replaces ranking with sequence generation. It treats skill composition as task-conditioned autoregressive decoding over skill identifiers,

$\max_\pi J(\tau)=\mathbb{E}_{\tau\sim\pi}\left[R(\tau)-\lambda\sum_{t=0}^{T} C(A_t,z_t)\right],$ 6

so subset, count, and order are predicted jointly rather than by top- $\max_\pi J(\tau)=\mathbb{E}_{\tau\sim\pi}\left[R(\tau)-\lambda\sum_{t=0}^{T} C(A_t,z_t)\right],$ 7 heuristics (Zhao et al., 30 Jun 2026). This is particularly significant because it moves routing from relevance estimation toward executable plan induction.

W2S extends routing inside the skill itself. By reconstructing a workflow graph $\max_\pi J(\tau)=\mathbb{E}_{\tau\sim\pi}\left[R(\tau)-\lambda\sum_{t=0}^{T} C(A_t,z_t)\right],$ 8, node-local semantics, and attachments from traces, it turns a skill into a routable control-flow object. Branching, looping, verification, approval, rollback, and state-management behaviors are preserved as node semantics and attachments rather than flattened into a text summary, so the runtime can route not only to a skill but through the skill’s internal execution structure (Zhang et al., 5 Jun 2026).

Together these works make clear that agentic compositional skill routing is layered. One layer selects relevant skills from a large library; another chooses the order or timing of their invocation; a third may traverse workflow nodes and attachments inside a selected skill.

5. Transfer, generalization, efficiency, and interpretability

A recurring motivation for compositional routing is that it should improve transfer to unseen combinations. ComposeNet demonstrated this directly on a Pacman-like collect/evade environment. When a composition layer trained on five tasks of the form $\max_\pi J(\tau)=\mathbb{E}_{\tau\sim\pi}\left[R(\tau)-\lambda\sum_{t=0}^{T} C(A_t,z_t)\right],$ 9 was applied zero-shot to the held-out task “collect blue while evade green,” the zero-shot reward was 0.45, and analogous zero-shot results were reported for “collect red or blue” (0.79), “evade red and green” (episode length 8.28), and “collect red then green” (0.53). In hierarchical reuse, “collect red or green while evade blue” achieved zero-shot reward $\hat{\mathbf{z}}=(\hat z_1,\hat z_2,\ldots,\hat z_{\hat n},STOP),$ 0 and improved quickly with training (Sahni et al., 2017).

SkillOrchestra uses explicit skill modeling to avoid routing collapse and to trade performance against cost. Across ten benchmarks it outperformed SoTA RL-based orchestrators by up to 22.5% with 700x and 300x learning cost reduction compared to Router-R1 and ToolOrchestra, respectively; in FRAMES it reached 84.3% accuracy at \$\hat{\mathbf{z}}=(\hat z_1,\hat z_2,\ldots,\hat z_{\hat n},STOP),$192.7 (Wang et al., 23 Feb 2026). SkillComposer similarly showed that structured sequence prediction over skill IDs is not merely elegant but operationally effective: on GPT-5.2-Codex and Gemini-3-Pro-Preview it raised pass rate by +23.1 and +18.2 percentage points over the no-skill baseline, while matching the gold-skill retrieval upper bound at lower prompt-token cost (Zhao et al., 30 Jun 2026).

Robotic results stress both compositional transfer and parameter efficiency. SMoDP’s full model reached 0.970 on LIBERO-90, versus 0.958 without InterCL, 0.957 without IntraCL, and 0.946 without both, indicating that semantic alignment of both skill embeddings and router logits materially improves routing quality. In few-shot transfer with experts frozen, SMoDP reached $\hat{\mathbf{z}}=(\hat z_1,\hat z_2,\ldots,\hat z_{\hat n},STOP),$2 success with 10 demos versus $\hat{\mathbf{z}}=(\hat z_1,\hat z_2,\ldots,\hat z_{\hat n},STOP),$3 for MoDE+LoRA at similar trainable parameter counts (Deng et al., 22 May 2026). SCE reported Final SR 83.4, NBT 4.3, and AUC 82.5 on LIBERO-Goal, and Final SR 73.4, NBT 0.5, and AUC 70.0 on LIBERO-Long, substantially outperforming task-level MoE baselines (Zhang et al., 14 Jun 2026).

Retrieval-oriented systems expose a different trade-off. R3-Embedding + R3-Reranker achieved Hit@1 $\hat{\mathbf{z}}=(\hat z_1,\hat z_2,\ldots,\hat z_{\hat n},STOP),$4, NDCG@10 $\hat{\mathbf{z}}=(\hat z_1,\hat z_2,\ldots,\hat z_{\hat n},STOP),$5, and Set-Compat $\hat{\mathbf{z}}=(\hat z_1,\hat z_2,\ldots,\hat z_{\hat n},STOP),$6 on R3-Skill, with the largest gains appearing precisely on the set-level metric most aligned with multi-skill composition (Wang et al., 2 Jun 2026). SkillRouter’s retrieve-and-rerank pipeline reached 74.0% top-1 routing accuracy while remaining small enough for consumer hardware (Zheng et al., 23 Mar 2026).

System	Reported result	Implication
ComposeNet	zero-shot reward 0.45 on “collect blue while evade green”	operator transfer across unseen combinations
SkillOrchestra	up to 22.5%; 700x and 300x learning cost reduction	explicit skill modeling improves accuracy–cost trade-off
SMoDP	0.970 full vs 0.946 w/o both contrastive losses	semantic routing regularization matters
SCE	Final SR 83.4 on LIBERO-Goal	skill-level reuse improves retention
R3	Set-Compat 0.3525	compatibility-aware retrieval improves joint skill retrieval
SkillRouter	74.0% top-1 routing accuracy	body-aware routing scales to large pools

Interpretability is another repeated benefit. SkillOrchestra exposes explicit skill analyses and per-agent competence profiles; SMoDP visualizes expert-activation heatmaps by skill phase; W2S preserves branch predicates and tool attachments; and capsule routing can be analyzed directly through routing-coefficient entropy (Wang et al., 23 Feb 2026, Deng et al., 22 May 2026, Zhang et al., 5 Jun 2026, Venkatraman et al., 2020).

6. Limitations, robustness, safety, and governance

Compositional skill routing is also a source of fragility. Several systems assume a fixed or slowly changing skill set, known task decomposition, or predefined operator type. ComposeNet assumes a fixed base-skill vocabulary, a known logical form such as while/and/or/then, and shallow hierarchies; SCE relies on state-based segmentation rules and VLM grounding; SkillOrchestra remains sensitive to the quality and granularity of discovered skills (Sahni et al., 2017, Zhang et al., 14 Jun 2026, Wang et al., 23 Feb 2026). A plausible implication is that routing quality is bottlenecked as much by ontology construction and interface design as by the router itself.

Robustness under dynamic skill libraries is a distinct concern. "Neural model robustness for skill routing in large-scale conversational AI systems" studies a commercial assistant in which the skill router ranks hypotheses generated from a shared ontology and current subscriptions. It shows that dynamic hypothesis insertion is the main source of brittleness and that random skill-injection augmentation during training can drastically improve robustness. Without augmentation, the Bi-LSTM + BCE model suffers a $\hat{\mathbf{z}}=(\hat z_1,\hat z_2,\ldots,\hat z_{\hat n},STOP),$ 7 online-like accuracy drop; with augmentation, all models show no degradation and often slight improvement on the online-like test, with attention-based + BCE + augmentation achieving +1.67% over the baseline on Test $\hat{\mathbf{z}}=(\hat z_1,\hat z_2,\ldots,\hat z_{\hat n},STOP),$ 8 (Li et al., 2021).

Safety introduces another dimension: the installed skill set itself can become unsafe even when each skill is individually safe. SkillReact formalizes this as compositional risk in agent skill ecosystems. On 1,520 ClawHub skills, 651 passed individual inspection and formed 211,575 pairs; the static benchmark flagged 22.25% of these as structural candidates, and human-calibrated population-weighted validity was 18.2%, implying about 14K genuine risk memberships in a single registry. Its action-based harness further showed that realization is gated by host-model disposition: on an anchor-conditioned dropper subset, Haiku-4-5 issued the dropper-stage tool call on all 39 direct-prompt trials, Opus-4-7 stopped at the download, and Sonnet-4-6 refused outright (Wang et al., 30 May 2026). The implication is that composition-aware install-time checks and capability isolation are necessary complements to per-skill scanning.

Governance becomes especially acute when skills are updated. "Atomic-Probe Governance for Skill Updates in Compositional Robot Policies" introduces a cross-version swap protocol and finds a dominant-skill effect on a dual-arm peg-in-hole task: one reach ECM achieved 86.7% atomic success rate while every other ECM was at or below 26.7%, and whether that dominant ECM entered a composition shifted success rate by up to +50 percentage points. Off-policy behavioral distance metrics failed to identify that dominant ECM. An atomic-only probe achieved 64.6% oracle match at zero per-decision cost, while a Hybrid Selector with $\hat{\mathbf{z}}=(\hat z_1,\hat z_2,\ldots,\hat z_{\hat n},STOP),$ 9 reached 75.0% at 45.8% of full-revalidation cost (Qin et al., 29 Apr 2026). This makes explicit that skill routing in deployed systems is not only about first-time composition; it is also about deciding when updated components may safely replace existing ones.

Across these works, the central unresolved issues are consistent: discovering skill granularity automatically, scaling routing over very large or deeply nested skill sets, handling non-stationary agent competence and pricing, preserving coverage while avoiding compatibility failures, and enforcing safety when compositional capability exceeds per-skill inspection. Compositional skill routing therefore remains both a constructive paradigm for modular intelligence and a governance problem whose difficulty grows with the expressiveness of the skill library.