Composition-RL: Modular Reuse in Reinforcement Learning

Updated 4 July 2026

Composition-RL is a family of reinforcement-learning methods that recombine pre-solved components to achieve zero-shot transfer and improved generalization.
It leverages techniques such as value-function composition, neural module routing, and language-grounded task synthesis to address diverse control and safety challenges.
The framework spans applications from stochastic control and safety guarantees to verifiable prompt and environment composition in RL, emphasizing structural reuse.

Composition-RL denotes a family of reinforcement-learning methods in which previously solved components are recombined to solve new problems, satisfy structured specifications, or improve generalization to unseen combinations. In its canonical early form, the problem is posed as follows: a lifelong-learning agent has already solved a library of “basis” tasks and wants to tackle a new task immediately—without any further learning—by re-using its existing value functions (Niekerk et al., 2018). The same label is now used across several related settings, including exact and approximate value-function composition, assume–guarantee synthesis for stochastic control systems, modular policy composition, language-grounded task composition, and the automatic composition of verifiable prompts or environments for RL with verifiable rewards (Adamczyk et al., 2022, Žikelić et al., 2023, Xu et al., 12 Feb 2026).

1. Conceptual scope

A recurring feature of Composition-RL is that the constituent tasks share a common substrate while differing in structured ways. In value-function composition, tasks share the same state and action spaces and differ only in reward functions, often only on absorbing goal states (Niekerk et al., 2018). In stochastic-control formulations, continuous-space subsystems are abstracted into finite MDPs or stochastic games and then synthesized compositionally via assume–guarantee reasoning or graph-structured reach-avoid decompositions (Lavaei et al., 2022, Žikelić et al., 2023). In lifelong and language-grounded settings, the common substrate is a reusable library of neural modules, world value functions, or parsed symbolic operators (Mendez et al., 2022, Cohen et al., 21 Jan 2025). In recent RLVR work, the common substrate is a pool of verifiable prompts or deterministic environments that can be fused into harder training instances (Xu et al., 12 Feb 2026, Xiang et al., 10 Jun 2026).

Setting	Composition object	Representative papers
Absorbing-task transfer	Soft or standard $Q$ -functions	(Niekerk et al., 2018, Adamczyk et al., 2022, Adamczyk et al., 2023)
Stochastic control systems	Abstract MDPs, edge policies, subsystem specifications	(Lavaei et al., 2022, Žikelić et al., 2023, Neary et al., 2023)
Lifelong and embodied RL	Neural modules, diagnostic subtask structure, Boolean WVFs	(Mendez et al., 2022, Meer et al., 2020, Cohen et al., 21 Jan 2025)
Compositional generalization	Outcome-level policy optimization over outputs	(Fu et al., 6 May 2026, Li et al., 26 May 2025, Yuan et al., 29 Sep 2025)
RL with verifiable rewards	Composed prompts or recursive environment chains	(Xu et al., 12 Feb 2026, Xiang et al., 10 Jun 2026)

This distribution of uses suggests that Composition-RL is not a single algorithmic template but a research program organized around the same operational goal: exploit compositional structure so that solving a new task requires less exploration, less retraining, or stronger guarantees than monolithic RL.

2. Value-function composition in entropy-regularized and standard RL

The foundational formulation of Composition-RL is given in entropy-regularized RL. Let $\mathcal S$ be the state space, $\mathcal A$ the finite action space, $\rho(s'|s,a)$ deterministic, and $\mathcal G\subset \mathcal S$ an absorbing goal set. A task MDP is specified by a reward function $r(s,a)$ that differs across tasks only on the absorbing states $\mathcal G$ . With temperature $\tau>0$ , the optimal policy is the Boltzmann distribution

$\pi^*_s(a)\propto \bar\pi_s(a)\exp(Q^*(s,a)/\tau),$

and the soft composition rule for a library of optimal entropy-regularized $Q$ -functions $\mathcal S$ 0 is

$\mathcal S$ 1

for non-negative weights $\mathcal S$ 2 summing to $\mathcal S$ 3 (Niekerk et al., 2018). Under deterministic $\mathcal S$ 4 and an MDP family where $\mathcal S$ 5 differs only on $\mathcal S$ 6, if the new task’s terminal reward satisfies

$\mathcal S$ 7

then this operator yields the true optimal $\mathcal S$ 8. In desirability variables $\mathcal S$ 9, composition is linear: $\mathcal A$ 0

As $\mathcal A$ 1, the entropy penalty vanishes and the composition rule reduces to

$\mathcal A$ 2

independent of $\mathcal A$ 3 (Niekerk et al., 2018). In this limit, the rule coincides with the “generalised policy improvement” or “max-over-heads” principle and is exactly optimal for new tasks whose terminal rewards are the pointwise max of the basis rewards. The same work reports exact OR-style composition, approximate AND-style composition by averaging $\mathcal A$ 4 and $\mathcal A$ 5, and temporal composition that creates an “always-collect-any-remaining-item” policy in an 11×11 pixel-based grid world with six collectible items.

Later soft-RL work broadens this picture from exact closed-form composition to correction-based composition. For two entropy-regularized tasks $\mathcal A$ 6 and $\mathcal A$ 7, with reward difference $\mathcal A$ 8, the optimal soft value functions satisfy

$\mathcal A$ 9

where $\rho(s'|s,a)$ 0 is the optimal soft $\rho(s'|s,a)$ 1-function of the auxiliary task with reward $\rho(s'|s,a)$ 2 (Adamczyk et al., 2022). The same framework shows that potential-based reward shaping remains policy-invariant in the entropy-regularized case: $\rho(s'|s,a)$ 3 with

$\rho(s'|s,a)$ 4

For a composite reward $\rho(s'|s,a)$ 5, the exact optimal soft $\rho(s'|s,a)$ 6-function has the form

$\rho(s'|s,a)$ 7

so a naïve composition can be corrected by learning a residual soft value function (Adamczyk et al., 2022).

A more general analysis replaces exact equalities by double-sided bounds. For a broad class of convex-sublinear or concave-superlinear composition functions, the optimal composite value $\rho(s'|s,a)$ 8 is bounded above and below by transformations of primitive $\rho(s'|s,a)$ 9-functions plus auxiliary slack terms; this yields regret bounds for zero-shot policies and motivates hard clipping, soft clipping, and test-time clipping during fine-tuning (Adamczyk et al., 2023). This suggests a three-tier taxonomy inside value-function Composition-RL: exact zero-shot composition under restrictive structural assumptions, exact composition with learned corrections in soft RL, and bounded zero-shot transfer with auxiliary uncertainty control.

3. Compositional synthesis with guarantees in stochastic control and safety

In discrete-time stochastic control systems, Composition-RL is formulated over a network of continuous-space stochastic subsystems $\mathcal G\subset \mathcal S$ 0, with unknown dynamics $\mathcal G\subset \mathcal S$ 1 that are only assumed Lipschitz in $\mathcal G\subset \mathcal S$ 2 and $\mathcal G\subset \mathcal S$ 3 with known constants $\mathcal G\subset \mathcal S$ 4 (Lavaei et al., 2022). Each subsystem is implicitly abstracted by a finite MDP $\mathcal G\subset \mathcal S$ 5 obtained from a uniform quantizer $\mathcal G\subset \mathcal S$ 6. The abstraction error in finite-horizon satisfaction is bounded by

$\mathcal G\subset \mathcal S$ 7

where $\mathcal G\subset \mathcal S$ 8 is the horizon, $\mathcal G\subset \mathcal S$ 9 the Lebesgue measure of $r(s,a)$ 0, $r(s,a)$ 1 the state-grid spacing, and $r(s,a)$ 2 the input-grid spacing. Each abstract MDP is viewed as a two-player stochastic game, synthesized locally by minimax-Q learning under worst-case internal input, with no knowledge of neighbors required. The network-level guarantee is a compositional lower bound: $r(s,a)$ 3 The same framework compiles finite-horizon co-safe LTL formulas into automata-based reward functions and uses potential-based reward shaping to densify sparse signals (Lavaei et al., 2022).

A closely related line uses logical specifications provided in SpectRL. Any SpectRL formula $r(s,a)$ 4 is compiled into a directed acyclic abstract graph $r(s,a)$ 5, where vertices carry “vertex regions” and edges carry “safety regions” (Žikelić et al., 2023). Edge policies are learned jointly with reach-avoid supermartingales (RASMs), and a multiplicative RASM yields a tighter lower bound on reach-avoid probability: $r(s,a)$ 6 The algorithm “Claps” then propagates lower bounds over the DAG via

$r(s,a)$ 7

and, if $r(s,a)$ 8, back-tracks the maximizing path and stitches the edge policies in sequence (Žikelić et al., 2023).

Verifiable compositional RL systems based on a high-level parametric MDP (pMDP) make the same decomposition explicit at the subsystem level. A collection of RL subsystems, each with entry conditions $r(s,a)$ 9, exit conditions $\mathcal G$ 0, and horizon $\mathcal G$ 1, is abstracted into a pMDP whose parameters $\mathcal G$ 2 represent subtask success probabilities (Neary et al., 2023). If each subsystem policy $\mathcal G$ 3 satisfies

$\mathcal G$ 4

then any high-level pMDP policy that reaches the abstract goal with probability at least $\mathcal G$ 5 induces a composed policy in the original environment that satisfies the overall task with at least the same probability. When empirical subsystem performance falls below specification, the framework re-solves a bilinear program to update the subtask thresholds and re-route the high-level policy (Neary et al., 2023).

Safety-aware task composition adds a distinct Boolean layer. In a deterministic labeled MDP, conjunction and disjunction of extended optimal $\mathcal G$ 6-functions are implemented by pointwise $\mathcal G$ 7 and $\mathcal G$ 8, while analytical negation for minimum-violation semantics is

$\mathcal G$ 9

The framework distinguishes minimum-violation paths from prioritized-safety paths and extends Boolean composition from discrete action spaces to continuous action spaces via TD3, with actor selection determined by the composed critics (Leahy et al., 2023). This makes explicit a trade-off already implicit in earlier composition rules: exact recombination is often easiest for reachability or OR-type objectives, while safety and avoidance introduce extra semantics, extra assumptions, or approximate reasoning.

4. Neural modules, language grounding, and embodied policy composition

A modular lifelong formulation assumes a fixed set of shared subproblem solution spaces $\tau>0$ 0 and a corresponding set of neural modules $\tau>0$ 1, such that the optimal policy for task $\tau>0$ 2 can be written as a composition

$\tau>0$ 3

Modules are organized into ordered depths, one module is selected per depth, online exploration is performed with PPO on a module copy, and offline consolidation uses Batch-Constrained Q-learning over replay buffers from all tasks seen so far (Mendez et al., 2022). The reported settings include 64 tasks in a discrete 2-D grid domain and 48 tasks in robotic manipulation. The method exhibits zero-shot generalization when structure is given, forward transfer during sequential learning, and retention via offline replay. A plausible implication is that Composition-RL in lifelong settings is as much about routing and reuse as about closed-form value algebra.

Language-conditioned Composition-RL makes the compositional structure explicit in either latent state or symbolic parses. In BabyAI-style instruction following, a diagnostic classifier is attached to the agent’s LSTM hidden state $\tau>0$ 4 to predict which subtask is currently active, with combined objective

$\tau>0$ 5

The classifier’s gradient shapes the hidden states so that they encode “current objective,” yielding more interpretable clustering and improved performance on repeated-visit and zero-shot transfer settings (Meer et al., 2020). The effect is not uniform: the paper reports negligible effect on the simplest levels, but a reduction in episode length on “Before-repeat” and improved zero-shot success on novel attributes.

CERLLA pushes this symbolic direction much further. It pretrains $\tau>0$ 6 world value functions, one per atomic attribute, and composes them through Boolean operators

$\tau>0$ 7

with negation

$\tau>0$ 8

A language instruction is mapped by a semantic parser to a Boolean expression over the symbols, and the resulting composed $\tau>0$ 9-function is used greedily (Cohen et al., 21 Jan 2025). The parser itself is improved by RL-style feedback: BM25 retrieves up to $\pi^*_s(a)\propto \bar\pi_s(a)\exp(Q^*(s,a)/\tau),$ 0 previous examples, an LLM proposes a beam of $\pi^*_s(a)\propto \bar\pi_s(a)\exp(Q^*(s,a)/\tau),$ 1 candidate expressions, and a candidate parse is accepted when the composed policy reaches a success-rate equal to the oracle upper-bound performance of $\pi^*_s(a)\propto \bar\pi_s(a)\exp(Q^*(s,a)/\tau),$ 2. On 162 BabyAI tasks, the method reaches a success rate equal to the oracle policy’s upper-bound performance of $\pi^*_s(a)\propto \bar\pi_s(a)\exp(Q^*(s,a)/\tau),$ 3, while the non-compositional baseline reaches $\pi^*_s(a)\propto \bar\pi_s(a)\exp(Q^*(s,a)/\tau),$ 4 with the same number of environment steps (Cohen et al., 21 Jan 2025).

Concept learning provides another embodied route to compositionality. In a 3D Unity environment where instructions specify color-shape targets, A2C agents trained directly on color-and-shape combinations require about $\pi^*_s(a)\propto \bar\pi_s(a)\exp(Q^*(s,a)/\tau),$ 5K episodes on train combinations and about $\pi^*_s(a)\propto \bar\pi_s(a)\exp(Q^*(s,a)/\tau),$ 6K episodes on held-out combinations. When agents are first trained on color-only or shape-only instructions and then fine-tuned on color-and-shape combinations, train learning drops to $\pi^*_s(a)\propto \bar\pi_s(a)\exp(Q^*(s,a)/\tau),$ 7K episodes and held-out learning to $\pi^*_s(a)\propto \bar\pi_s(a)\exp(Q^*(s,a)/\tau),$ 8K episodes; only these concept-then-compose agents solve a more complex zero-shot $\pi^*_s(a)\propto \bar\pi_s(a)\exp(Q^*(s,a)/\tau),$ 9 environment (Lin et al., 2023). This suggests that Composition-RL can emerge either from explicit operators over learned primitives or from representational factorization that makes later recombination much easier.

5. Outcome-level RL and compositional generalization in generative models

A distinct contemporary meaning of Composition-RL concerns models that must interpret or generate unseen combinations of known primitives. One formulation defines a compositional type $Q$ 0, where $Q$ 1 is an ordered tuple of primitives and $Q$ 2 is a composition rule. Instead of token-level cross-entropy, the policy $Q$ 3 is optimized at the outcome level with Group Relative Policy Optimization (GRPO). Two rewards are studied: a binary exact-match reward and a composite reward built from primitive coverage and a compositional-skeleton term (Fu et al., 6 May 2026). On SCAN, COGS, GeoQuery, and CFQ, GRPO improves compositional generalization over supervised fine-tuning. The exact-match averages over three runs reported for SFT, GRPO-Bin, and GRPO-Comp are $Q$ 4, $Q$ 5, and $Q$ 6, with especially large gains on SCAN-length and CFQ-MCD3 (Fu et al., 6 May 2026). The paper further reports that supervised models exhibit higher mean training-data trigram frequency among incorrect predictions, whereas GRPO reduces this copying bias and sharpens the output distribution.

Vision-language reasoning exposes a harder version of the same problem. ComPABench trains on isolated skills such as Shape Area and Grid Position, then evaluates cross-modal, cross-task, and out-of-distribution compositions (Li et al., 26 May 2025). For the 7B model, pure-text SFT attains near-perfect performance on the component tasks but drops to $Q$ 7 on pure-text compositional evaluation; pure-text RL reaches $Q$ 8. In multimodal composition, SFT attains $Q$ 9 and RL $\mathcal S$ 00. The paper argues that current VLMs trained with RL or other post-training strategies still struggle compositionally under cross-modal and cross-task scenario, and proposes RL-Ground, which combines “caption-before-thinking” with progressive vision-to-text grounding rewards. On multimodal composition, the 7B RL-Ground variant reaches $\mathcal S$ 01, compared with $\mathcal S$ 02 for baseline RL (Li et al., 26 May 2025).

A synthetic string-transformation study addresses a central controversy directly: whether RL teaches genuinely new skills or merely activates existing ones. Atomic skills are deterministic functions $\mathcal S$ 03, $\mathcal S$ 04, and the compositional target is $\mathcal S$ 05. Stage 1 trains atomic skills; Stage 2 performs RL on composed tasks with a terminal binary reward (Yuan et al., 29 Sep 2025). RL on atomic data only succeeds on Level 1 but fails to generalize compositionally, while RL on Level 2 compositions generalizes to deeper unseen levels and transfers to a different target task, Countdown, when the model already has the target’s atomic skills. The same paper reports that next-token training on the same compositional data does not produce these effects. This provides explicit evidence for a compositional-skill acquisition account of RL in at least one controlled setting (Yuan et al., 29 Sep 2025).

6. Verifiable prompt and environment composition for RL with verifiable rewards

In RL with verifiable rewards, Composition-RL is used to create new training tasks automatically. One proposal targets the pass-rate-1 problem in RLVR: policy-gradient updates vanish when a prompt’s sampled rollouts are all correct or all incorrect, and easy prompts become increasingly prevalent as training proceeds (Xu et al., 12 Feb 2026). Sequential Prompt Composition (SPC) defines a recursive operator that combines multiple original prompts into a new verifiable question of compositional depth $\mathcal S$ 06. Training uses GRPO over the surrogate compositional dataset $\mathcal S$ 07, optionally with a curriculum $\mathcal S$ 08, then $\mathcal S$ 09, then $\mathcal S$ 10. On Qwen3-4B, the overall average rises from $\mathcal S$ 11 for baseline RL on original prompts to $\mathcal S$ 12 for depth-2 Composition-RL; the curriculum variant reaches $\mathcal S$ 13. In cross-domain training, Physics-Math-Composition reaches $\mathcal S$ 14, compared with $\mathcal S$ 15 for mix training and $\mathcal S$ 16 for sequential Math-then-Physics (Xu et al., 12 Feb 2026).

A related but more general construction treats verifiable environments themselves as composable objects. In RACES, each environment is a four-tuple $\mathcal S$ 17, and two environments are composable when the codomain of one matches the domain of the next (Xiang et al., 10 Jun 2026). Recursive composition yields chains $\mathcal S$ 18, and four operators instantiate RL tasks from these chains: $\mathcal S$ 19, $\mathcal S$ 20, $\mathcal S$ 21, and $\mathcal S$ 22. Path discovery uses a frontier-based BFS with quality filters on runtime errors, timeouts $\mathcal S$ 23 s, excessive steps $\mathcal S$ 24, and degenerate outputs. Across six unseen benchmarks, RACES improves DeepSeek-R1-Distill-Qwen-14B from $\mathcal S$ 25 to $\mathcal S$ 26 and Qwen3-14B from $\mathcal S$ 27 to $\mathcal S$ 28. On Qwen3-4B-Instruct-2507, RL on 50 base environments composed by RACES reaches $\mathcal S$ 29, exceeding $\mathcal S$ 30 from RL on 300 individual environments (Xiang et al., 10 Jun 2026).

These RLVR formulations invert the direction of classical Composition-RL. Instead of composing policies or value functions to solve a new downstream task, they compose tasks themselves to generate a richer on-policy curriculum. The common mechanism remains structural reuse: existing solved or verifiable units are not discarded after training but reassembled into more difficult instances whose reward remains exact.

Across these strands, several limitations recur. Exact value-function composition typically requires tasks that differ only in terminal-state rewards and, in the 2018 recipe, deterministic dynamics; AND-type composition is only approximate there (Niekerk et al., 2018). Soft correction and bound-based methods often assume shared state and action spaces and remain primarily tabular in their formal analyses (Adamczyk et al., 2022, Adamczyk et al., 2023). Formal-control approaches depend on abstraction error bounds, verification machinery, or graph decompositions (Lavaei et al., 2022, Žikelić et al., 2023). RLVR composition depends on reliable verifiers and incurs longer prompts or deeper environment chains as compositional depth increases (Xu et al., 12 Feb 2026, Xiang et al., 10 Jun 2026). Even with these constraints, the literature consistently treats composition not as a peripheral convenience but as the central mechanism for zero-shot transfer, safety-preserving synthesis, modular reuse, and compositional generalization.