Papers
Topics
Authors
Recent
Search
2000 character limit reached

Composition-RL: Modular Reuse in Reinforcement Learning

Updated 4 July 2026
  • Composition-RL is a family of reinforcement-learning methods that recombine pre-solved components to achieve zero-shot transfer and improved generalization.
  • It leverages techniques such as value-function composition, neural module routing, and language-grounded task synthesis to address diverse control and safety challenges.
  • The framework spans applications from stochastic control and safety guarantees to verifiable prompt and environment composition in RL, emphasizing structural reuse.

Composition-RL denotes a family of reinforcement-learning methods in which previously solved components are recombined to solve new problems, satisfy structured specifications, or improve generalization to unseen combinations. In its canonical early form, the problem is posed as follows: a lifelong-learning agent has already solved a library of “basis” tasks and wants to tackle a new task immediately—without any further learning—by re-using its existing value functions (Niekerk et al., 2018). The same label is now used across several related settings, including exact and approximate value-function composition, assume–guarantee synthesis for stochastic control systems, modular policy composition, language-grounded task composition, and the automatic composition of verifiable prompts or environments for RL with verifiable rewards (Adamczyk et al., 2022, Žikelić et al., 2023, Xu et al., 12 Feb 2026).

1. Conceptual scope

A recurring feature of Composition-RL is that the constituent tasks share a common substrate while differing in structured ways. In value-function composition, tasks share the same state and action spaces and differ only in reward functions, often only on absorbing goal states (Niekerk et al., 2018). In stochastic-control formulations, continuous-space subsystems are abstracted into finite MDPs or stochastic games and then synthesized compositionally via assume–guarantee reasoning or graph-structured reach-avoid decompositions (Lavaei et al., 2022, Žikelić et al., 2023). In lifelong and language-grounded settings, the common substrate is a reusable library of neural modules, world value functions, or parsed symbolic operators (Mendez et al., 2022, Cohen et al., 21 Jan 2025). In recent RLVR work, the common substrate is a pool of verifiable prompts or deterministic environments that can be fused into harder training instances (Xu et al., 12 Feb 2026, Xiang et al., 10 Jun 2026).

Setting Composition object Representative papers
Absorbing-task transfer Soft or standard QQ-functions (Niekerk et al., 2018, Adamczyk et al., 2022, Adamczyk et al., 2023)
Stochastic control systems Abstract MDPs, edge policies, subsystem specifications (Lavaei et al., 2022, Žikelić et al., 2023, Neary et al., 2023)
Lifelong and embodied RL Neural modules, diagnostic subtask structure, Boolean WVFs (Mendez et al., 2022, Meer et al., 2020, Cohen et al., 21 Jan 2025)
Compositional generalization Outcome-level policy optimization over outputs (Fu et al., 6 May 2026, Li et al., 26 May 2025, Yuan et al., 29 Sep 2025)
RL with verifiable rewards Composed prompts or recursive environment chains (Xu et al., 12 Feb 2026, Xiang et al., 10 Jun 2026)

This distribution of uses suggests that Composition-RL is not a single algorithmic template but a research program organized around the same operational goal: exploit compositional structure so that solving a new task requires less exploration, less retraining, or stronger guarantees than monolithic RL.

2. Value-function composition in entropy-regularized and standard RL

The foundational formulation of Composition-RL is given in entropy-regularized RL. Let S\mathcal S be the state space, A\mathcal A the finite action space, ρ(ss,a)\rho(s'|s,a) deterministic, and GS\mathcal G\subset \mathcal S an absorbing goal set. A task MDP is specified by a reward function r(s,a)r(s,a) that differs across tasks only on the absorbing states G\mathcal G. With temperature τ>0\tau>0, the optimal policy is the Boltzmann distribution

πs(a)πˉs(a)exp(Q(s,a)/τ),\pi^*_s(a)\propto \bar\pi_s(a)\exp(Q^*(s,a)/\tau),

and the soft composition rule for a library of optimal entropy-regularized QQ-functions S\mathcal S0 is

S\mathcal S1

for non-negative weights S\mathcal S2 summing to S\mathcal S3 (Niekerk et al., 2018). Under deterministic S\mathcal S4 and an MDP family where S\mathcal S5 differs only on S\mathcal S6, if the new task’s terminal reward satisfies

S\mathcal S7

then this operator yields the true optimal S\mathcal S8. In desirability variables S\mathcal S9, composition is linear: A\mathcal A0

As A\mathcal A1, the entropy penalty vanishes and the composition rule reduces to

A\mathcal A2

independent of A\mathcal A3 (Niekerk et al., 2018). In this limit, the rule coincides with the “generalised policy improvement” or “max-over-heads” principle and is exactly optimal for new tasks whose terminal rewards are the pointwise max of the basis rewards. The same work reports exact OR-style composition, approximate AND-style composition by averaging A\mathcal A4 and A\mathcal A5, and temporal composition that creates an “always-collect-any-remaining-item” policy in an 11×11 pixel-based grid world with six collectible items.

Later soft-RL work broadens this picture from exact closed-form composition to correction-based composition. For two entropy-regularized tasks A\mathcal A6 and A\mathcal A7, with reward difference A\mathcal A8, the optimal soft value functions satisfy

A\mathcal A9

where ρ(ss,a)\rho(s'|s,a)0 is the optimal soft ρ(ss,a)\rho(s'|s,a)1-function of the auxiliary task with reward ρ(ss,a)\rho(s'|s,a)2 (Adamczyk et al., 2022). The same framework shows that potential-based reward shaping remains policy-invariant in the entropy-regularized case: ρ(ss,a)\rho(s'|s,a)3 with

ρ(ss,a)\rho(s'|s,a)4

For a composite reward ρ(ss,a)\rho(s'|s,a)5, the exact optimal soft ρ(ss,a)\rho(s'|s,a)6-function has the form

ρ(ss,a)\rho(s'|s,a)7

so a naïve composition can be corrected by learning a residual soft value function (Adamczyk et al., 2022).

A more general analysis replaces exact equalities by double-sided bounds. For a broad class of convex-sublinear or concave-superlinear composition functions, the optimal composite value ρ(ss,a)\rho(s'|s,a)8 is bounded above and below by transformations of primitive ρ(ss,a)\rho(s'|s,a)9-functions plus auxiliary slack terms; this yields regret bounds for zero-shot policies and motivates hard clipping, soft clipping, and test-time clipping during fine-tuning (Adamczyk et al., 2023). This suggests a three-tier taxonomy inside value-function Composition-RL: exact zero-shot composition under restrictive structural assumptions, exact composition with learned corrections in soft RL, and bounded zero-shot transfer with auxiliary uncertainty control.

3. Compositional synthesis with guarantees in stochastic control and safety

In discrete-time stochastic control systems, Composition-RL is formulated over a network of continuous-space stochastic subsystems GS\mathcal G\subset \mathcal S0, with unknown dynamics GS\mathcal G\subset \mathcal S1 that are only assumed Lipschitz in GS\mathcal G\subset \mathcal S2 and GS\mathcal G\subset \mathcal S3 with known constants GS\mathcal G\subset \mathcal S4 (Lavaei et al., 2022). Each subsystem is implicitly abstracted by a finite MDP GS\mathcal G\subset \mathcal S5 obtained from a uniform quantizer GS\mathcal G\subset \mathcal S6. The abstraction error in finite-horizon satisfaction is bounded by

GS\mathcal G\subset \mathcal S7

where GS\mathcal G\subset \mathcal S8 is the horizon, GS\mathcal G\subset \mathcal S9 the Lebesgue measure of r(s,a)r(s,a)0, r(s,a)r(s,a)1 the state-grid spacing, and r(s,a)r(s,a)2 the input-grid spacing. Each abstract MDP is viewed as a two-player stochastic game, synthesized locally by minimax-Q learning under worst-case internal input, with no knowledge of neighbors required. The network-level guarantee is a compositional lower bound: r(s,a)r(s,a)3 The same framework compiles finite-horizon co-safe LTL formulas into automata-based reward functions and uses potential-based reward shaping to densify sparse signals (Lavaei et al., 2022).

A closely related line uses logical specifications provided in SpectRL. Any SpectRL formula r(s,a)r(s,a)4 is compiled into a directed acyclic abstract graph r(s,a)r(s,a)5, where vertices carry “vertex regions” and edges carry “safety regions” (Žikelić et al., 2023). Edge policies are learned jointly with reach-avoid supermartingales (RASMs), and a multiplicative RASM yields a tighter lower bound on reach-avoid probability: r(s,a)r(s,a)6 The algorithm “Claps” then propagates lower bounds over the DAG via

r(s,a)r(s,a)7

and, if r(s,a)r(s,a)8, back-tracks the maximizing path and stitches the edge policies in sequence (Žikelić et al., 2023).

Verifiable compositional RL systems based on a high-level parametric MDP (pMDP) make the same decomposition explicit at the subsystem level. A collection of RL subsystems, each with entry conditions r(s,a)r(s,a)9, exit conditions G\mathcal G0, and horizon G\mathcal G1, is abstracted into a pMDP whose parameters G\mathcal G2 represent subtask success probabilities (Neary et al., 2023). If each subsystem policy G\mathcal G3 satisfies

G\mathcal G4

then any high-level pMDP policy that reaches the abstract goal with probability at least G\mathcal G5 induces a composed policy in the original environment that satisfies the overall task with at least the same probability. When empirical subsystem performance falls below specification, the framework re-solves a bilinear program to update the subtask thresholds and re-route the high-level policy (Neary et al., 2023).

Safety-aware task composition adds a distinct Boolean layer. In a deterministic labeled MDP, conjunction and disjunction of extended optimal G\mathcal G6-functions are implemented by pointwise G\mathcal G7 and G\mathcal G8, while analytical negation for minimum-violation semantics is

G\mathcal G9

The framework distinguishes minimum-violation paths from prioritized-safety paths and extends Boolean composition from discrete action spaces to continuous action spaces via TD3, with actor selection determined by the composed critics (Leahy et al., 2023). This makes explicit a trade-off already implicit in earlier composition rules: exact recombination is often easiest for reachability or OR-type objectives, while safety and avoidance introduce extra semantics, extra assumptions, or approximate reasoning.

4. Neural modules, language grounding, and embodied policy composition

A modular lifelong formulation assumes a fixed set of shared subproblem solution spaces τ>0\tau>00 and a corresponding set of neural modules τ>0\tau>01, such that the optimal policy for task τ>0\tau>02 can be written as a composition

τ>0\tau>03

Modules are organized into ordered depths, one module is selected per depth, online exploration is performed with PPO on a module copy, and offline consolidation uses Batch-Constrained Q-learning over replay buffers from all tasks seen so far (Mendez et al., 2022). The reported settings include 64 tasks in a discrete 2-D grid domain and 48 tasks in robotic manipulation. The method exhibits zero-shot generalization when structure is given, forward transfer during sequential learning, and retention via offline replay. A plausible implication is that Composition-RL in lifelong settings is as much about routing and reuse as about closed-form value algebra.

Language-conditioned Composition-RL makes the compositional structure explicit in either latent state or symbolic parses. In BabyAI-style instruction following, a diagnostic classifier is attached to the agent’s LSTM hidden state τ>0\tau>04 to predict which subtask is currently active, with combined objective

τ>0\tau>05

The classifier’s gradient shapes the hidden states so that they encode “current objective,” yielding more interpretable clustering and improved performance on repeated-visit and zero-shot transfer settings (Meer et al., 2020). The effect is not uniform: the paper reports negligible effect on the simplest levels, but a reduction in episode length on “Before-repeat” and improved zero-shot success on novel attributes.

CERLLA pushes this symbolic direction much further. It pretrains τ>0\tau>06 world value functions, one per atomic attribute, and composes them through Boolean operators

τ>0\tau>07

with negation

τ>0\tau>08

A language instruction is mapped by a semantic parser to a Boolean expression over the symbols, and the resulting composed τ>0\tau>09-function is used greedily (Cohen et al., 21 Jan 2025). The parser itself is improved by RL-style feedback: BM25 retrieves up to πs(a)πˉs(a)exp(Q(s,a)/τ),\pi^*_s(a)\propto \bar\pi_s(a)\exp(Q^*(s,a)/\tau),0 previous examples, an LLM proposes a beam of πs(a)πˉs(a)exp(Q(s,a)/τ),\pi^*_s(a)\propto \bar\pi_s(a)\exp(Q^*(s,a)/\tau),1 candidate expressions, and a candidate parse is accepted when the composed policy reaches a success-rate equal to the oracle upper-bound performance of πs(a)πˉs(a)exp(Q(s,a)/τ),\pi^*_s(a)\propto \bar\pi_s(a)\exp(Q^*(s,a)/\tau),2. On 162 BabyAI tasks, the method reaches a success rate equal to the oracle policy’s upper-bound performance of πs(a)πˉs(a)exp(Q(s,a)/τ),\pi^*_s(a)\propto \bar\pi_s(a)\exp(Q^*(s,a)/\tau),3, while the non-compositional baseline reaches πs(a)πˉs(a)exp(Q(s,a)/τ),\pi^*_s(a)\propto \bar\pi_s(a)\exp(Q^*(s,a)/\tau),4 with the same number of environment steps (Cohen et al., 21 Jan 2025).

Concept learning provides another embodied route to compositionality. In a 3D Unity environment where instructions specify color-shape targets, A2C agents trained directly on color-and-shape combinations require about πs(a)πˉs(a)exp(Q(s,a)/τ),\pi^*_s(a)\propto \bar\pi_s(a)\exp(Q^*(s,a)/\tau),5K episodes on train combinations and about πs(a)πˉs(a)exp(Q(s,a)/τ),\pi^*_s(a)\propto \bar\pi_s(a)\exp(Q^*(s,a)/\tau),6K episodes on held-out combinations. When agents are first trained on color-only or shape-only instructions and then fine-tuned on color-and-shape combinations, train learning drops to πs(a)πˉs(a)exp(Q(s,a)/τ),\pi^*_s(a)\propto \bar\pi_s(a)\exp(Q^*(s,a)/\tau),7K episodes and held-out learning to πs(a)πˉs(a)exp(Q(s,a)/τ),\pi^*_s(a)\propto \bar\pi_s(a)\exp(Q^*(s,a)/\tau),8K episodes; only these concept-then-compose agents solve a more complex zero-shot πs(a)πˉs(a)exp(Q(s,a)/τ),\pi^*_s(a)\propto \bar\pi_s(a)\exp(Q^*(s,a)/\tau),9 environment (Lin et al., 2023). This suggests that Composition-RL can emerge either from explicit operators over learned primitives or from representational factorization that makes later recombination much easier.

5. Outcome-level RL and compositional generalization in generative models

A distinct contemporary meaning of Composition-RL concerns models that must interpret or generate unseen combinations of known primitives. One formulation defines a compositional type QQ0, where QQ1 is an ordered tuple of primitives and QQ2 is a composition rule. Instead of token-level cross-entropy, the policy QQ3 is optimized at the outcome level with Group Relative Policy Optimization (GRPO). Two rewards are studied: a binary exact-match reward and a composite reward built from primitive coverage and a compositional-skeleton term (Fu et al., 6 May 2026). On SCAN, COGS, GeoQuery, and CFQ, GRPO improves compositional generalization over supervised fine-tuning. The exact-match averages over three runs reported for SFT, GRPO-Bin, and GRPO-Comp are QQ4, QQ5, and QQ6, with especially large gains on SCAN-length and CFQ-MCD3 (Fu et al., 6 May 2026). The paper further reports that supervised models exhibit higher mean training-data trigram frequency among incorrect predictions, whereas GRPO reduces this copying bias and sharpens the output distribution.

Vision-language reasoning exposes a harder version of the same problem. ComPABench trains on isolated skills such as Shape Area and Grid Position, then evaluates cross-modal, cross-task, and out-of-distribution compositions (Li et al., 26 May 2025). For the 7B model, pure-text SFT attains near-perfect performance on the component tasks but drops to QQ7 on pure-text compositional evaluation; pure-text RL reaches QQ8. In multimodal composition, SFT attains QQ9 and RL S\mathcal S00. The paper argues that current VLMs trained with RL or other post-training strategies still struggle compositionally under cross-modal and cross-task scenario, and proposes RL-Ground, which combines “caption-before-thinking” with progressive vision-to-text grounding rewards. On multimodal composition, the 7B RL-Ground variant reaches S\mathcal S01, compared with S\mathcal S02 for baseline RL (Li et al., 26 May 2025).

A synthetic string-transformation study addresses a central controversy directly: whether RL teaches genuinely new skills or merely activates existing ones. Atomic skills are deterministic functions S\mathcal S03, S\mathcal S04, and the compositional target is S\mathcal S05. Stage 1 trains atomic skills; Stage 2 performs RL on composed tasks with a terminal binary reward (Yuan et al., 29 Sep 2025). RL on atomic data only succeeds on Level 1 but fails to generalize compositionally, while RL on Level 2 compositions generalizes to deeper unseen levels and transfers to a different target task, Countdown, when the model already has the target’s atomic skills. The same paper reports that next-token training on the same compositional data does not produce these effects. This provides explicit evidence for a compositional-skill acquisition account of RL in at least one controlled setting (Yuan et al., 29 Sep 2025).

6. Verifiable prompt and environment composition for RL with verifiable rewards

In RL with verifiable rewards, Composition-RL is used to create new training tasks automatically. One proposal targets the pass-rate-1 problem in RLVR: policy-gradient updates vanish when a prompt’s sampled rollouts are all correct or all incorrect, and easy prompts become increasingly prevalent as training proceeds (Xu et al., 12 Feb 2026). Sequential Prompt Composition (SPC) defines a recursive operator that combines multiple original prompts into a new verifiable question of compositional depth S\mathcal S06. Training uses GRPO over the surrogate compositional dataset S\mathcal S07, optionally with a curriculum S\mathcal S08, then S\mathcal S09, then S\mathcal S10. On Qwen3-4B, the overall average rises from S\mathcal S11 for baseline RL on original prompts to S\mathcal S12 for depth-2 Composition-RL; the curriculum variant reaches S\mathcal S13. In cross-domain training, Physics-Math-Composition reaches S\mathcal S14, compared with S\mathcal S15 for mix training and S\mathcal S16 for sequential Math-then-Physics (Xu et al., 12 Feb 2026).

A related but more general construction treats verifiable environments themselves as composable objects. In RACES, each environment is a four-tuple S\mathcal S17, and two environments are composable when the codomain of one matches the domain of the next (Xiang et al., 10 Jun 2026). Recursive composition yields chains S\mathcal S18, and four operators instantiate RL tasks from these chains: S\mathcal S19, S\mathcal S20, S\mathcal S21, and S\mathcal S22. Path discovery uses a frontier-based BFS with quality filters on runtime errors, timeouts S\mathcal S23 s, excessive steps S\mathcal S24, and degenerate outputs. Across six unseen benchmarks, RACES improves DeepSeek-R1-Distill-Qwen-14B from S\mathcal S25 to S\mathcal S26 and Qwen3-14B from S\mathcal S27 to S\mathcal S28. On Qwen3-4B-Instruct-2507, RL on 50 base environments composed by RACES reaches S\mathcal S29, exceeding S\mathcal S30 from RL on 300 individual environments (Xiang et al., 10 Jun 2026).

These RLVR formulations invert the direction of classical Composition-RL. Instead of composing policies or value functions to solve a new downstream task, they compose tasks themselves to generate a richer on-policy curriculum. The common mechanism remains structural reuse: existing solved or verifiable units are not discarded after training but reassembled into more difficult instances whose reward remains exact.

Across these strands, several limitations recur. Exact value-function composition typically requires tasks that differ only in terminal-state rewards and, in the 2018 recipe, deterministic dynamics; AND-type composition is only approximate there (Niekerk et al., 2018). Soft correction and bound-based methods often assume shared state and action spaces and remain primarily tabular in their formal analyses (Adamczyk et al., 2022, Adamczyk et al., 2023). Formal-control approaches depend on abstraction error bounds, verification machinery, or graph decompositions (Lavaei et al., 2022, Žikelić et al., 2023). RLVR composition depends on reliable verifiers and incurs longer prompts or deeper environment chains as compositional depth increases (Xu et al., 12 Feb 2026, Xiang et al., 10 Jun 2026). Even with these constraints, the literature consistently treats composition not as a peripheral convenience but as the central mechanism for zero-shot transfer, safety-preserving synthesis, modular reuse, and compositional generalization.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (16)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Composition-RL.