Reward Machine Decomposition in RL

Updated 10 May 2026

Reward Machine Decomposition is a structured framework that represents non-Markovian reward functions as finite-state automata to modularize complex RL tasks.
It employs state- and transition-based decomposition along with projection techniques in multi-agent settings to enable efficient sub-task policy learning.
Hierarchical and curriculum-driven methods enhance sample efficiency and interpretability, especially under partial observability and noisy event signals.

Reward machine decomposition is a principled framework for representing, analyzing, and exploiting the temporal and structural properties of non-Markovian reward functions in reinforcement learning (RL). By modeling complex reward structures as automata—reward machines (RMs)—and systematically decomposing these machines, RL tasks can be partitioned into simpler, independently solvable subproblems. This decomposition underpins advances in sample efficiency, credit assignment, interpretability, and scalability, spanning both single-agent and multi-agent domains.

1. Formalism of Reward Machines and Decomposition Principles

A reward machine is a finite-state automaton that processes high-level events (e.g., propositional assignments, symbolic features, or sensor detections) and emits scalar rewards during transitions. A general RM is given as a tuple: $\mathcal R = \langle U, u_1, F, \mathcal{AP}, \delta_u, \delta_r \rangle$ where $U$ is the set of non-terminal RM states, $u_1$ the initial state, $F$ the set of terminal states, $\mathcal{AP}$ the atomic proposition vocabulary, $\delta_u$ the RM transition function, and $\delta_r$ the immediate reward function. The automaton is "driven" by atomic propositions detected on each environment transition, enabling temporally extended and non-Markovian reward specification (Icarte et al., 2020, Li et al., 2024).

Decomposition exploits this explicit structure by partitioning the original RM $R$ into a set $\{R_i\}$ . In multi-agent settings, decomposition typically projects $R$ onto agent-specific local event alphabets, yielding local RMs that are equivalent up to a behavioral bisimulation, ensuring that joint satisfaction of all sub-RMs matches the global task (Shah et al., 19 Feb 2025, Ardon et al., 2023, Neary et al., 2020).

2. Decomposition Methodologies and Correctness

2.1 State- and Transition-Based Decomposition

Per-state decomposition (QRM/CRM): Associates a separate Q-function to each RM state. Experience is used to update all state-conditioned Q-functions via counterfactual or off-policy updates, allowing modular learning (Icarte et al., 2020).
Per-transition/option decomposition (HRM): Each RM transition (edge) corresponds to a temporally extended "option," following the standard options framework. An option is initiated in a given RM state and terminates on transition, facilitating both high-level planning and sub-task policy learning (Furelos-Blanco et al., 2022, Icarte et al., 2020).

2.2 Multi-Agent Decomposition

RM decomposition in multi-agent systems operates via projection/quotient constructions:

Each agent $U$ 0 is given a local alphabet $U$ 1, with the global event set $U$ 2. The RM is projected onto $U$ 3 by partitioning states into equivalence classes that cannot be distinguished without observing $U$ 4 (Neary et al., 2020, Shah et al., 19 Feb 2025).
Design-time verification via bisimulation checks guarantees that the parallel composition of sub-RMs recovers the global RM's semantics (i.e., $U$ 5) (Neary et al., 2020).

Methodology	Decomposition Unit	Theoretical Guarantees
QRM/CRM	RM states	Converges to RM-optimal policy
HRM/options	RM transitions/options	Hierarchically optimal policies
Projection for multi-agent RMs	Event-alphabet quotient	Behavioral bisimulation

3. Algorithms and Inference under Partial Observability

Reward machine decomposition is robust to uncertainty, partial observability, and sensor noise. In such cases, inference over the latent RM state must be coupled with policy learning.

Belief-state tracking: Maintains a belief distribution $U$ $U$ 6 over RM states, approximating $U$ $U$ 7. Three main inference strategies are:
- Naive thresholding (point estimate via noisy classification).
- Independent belief updating (propagate probabilities via propositional assignment models).
- Temporal-dependency modeling (TDM): employs history-dependent models to output a belief over RM states directly.

TDM delivers consistent RM-state posteriors under temporal dependence and matches oracle performance in deep RL tasks under partial observability. These inference signals augment RL policies with relevant task memory, offering dramatic sample-efficiency gains over flat or memoryless baselines (Li et al., 2024).

4. Hierarchical and Curriculum-Based Decomposition

Complex tasks often exhibit nested or recursive subgoal structure. Hierarchical reward machines (HRMs) extend the flat RM model by allowing invocation of sub-machines, analogous to function calls:

Hierarchical RMs: Each RM can "call" other RMs; execution proceeds via a call stack, maintaining deterministic trace semantics across machines (Furelos-Blanco et al., 2022). The HRM is translatable to an equivalent flat RM of (typically exponentially larger) state space, emphasizing the representational economy of hierarchies.
Learning HRMs: A curriculum-based algorithm interleaves RM induction (often by Answer Set Programming, e.g., ILASP) with hierarchical RL, using curriculum selection to focus exploration and counter-examples to trigger structure learning.

Curriculum-driven HRM learning accelerates convergence relative to flat decompositions, especially in long-horizon or sparse-reward environments (Furelos-Blanco et al., 2022).

5. Practical Impact: Sample Efficiency, Credit Assignment, and Interpretability

Empirical studies across domains—from gridworlds and robotic manipulation to multi-agent Overcooked—demonstrate the practical benefits of RM decomposition:

Sample efficiency: Modular subproblem learning reduces variance and concentrates learning on subgoals, yielding an order-of-magnitude faster policy acquisition compared to non-modular baselines. Decomposition enables rapid policy improvement with very few demonstrations in vision-based manipulation (Camacho et al., 2020).
Credit assignment: In cooperative multi-agent settings, decomposition mitigates the global credit assignment problem by localizing reward to productive agent behaviors. This enables robust policy learning even under codependent or synchronous agent dynamics (e.g., joint coordination in "buttons" and "rendezvous" tasks) (Shah et al., 19 Feb 2025).
Interpretability: Subtask RM structures map directly to human-understandable plans or protocols, providing transparency and debuggability beyond end-to-end learning (Ardon et al., 2023).

Domain	Baseline	With RM Decomposition
Multi-agent grid	Flat RL ~fails	Parellel RM learning converges
Vision robotics	DQN 5%<2000 episodes	DQRM >95% in 500 episodes
POMDPs	RNN RL stalls	TDM nearly matches Oracle

6. Automated Learning of Decompositions

Reward machine decomposition can be achieved without prior domain knowledge. Algorithms exist for mining RMs from experience:

Optimization objective (LRM problem): Given traces, infer a minimal RM that decomposes the POMDP into conditionally-independent subproblems such that their joint policy achieves global optimality (Icarte et al., 2021).
Simultaneous policy and decomposition learning: Methods such as LOTaD employ task-conditioned architectures and bandit-based selection to search over candidate decompositions and jointly optimize policies (Shah et al., 19 Feb 2025).

These techniques enable decompositional RL in settings with unknown, partially observed, or dynamically changing task structures, subject to computational constraints in the search space of RMs.

7. Limitations and Open Questions

Current RM decomposition frameworks assume access to accurate event labels or labeling functions; learning or refining these abstractions under domain noise remains challenging (Li et al., 2024, Ardon et al., 2023). Communication of RM state or event signals is required for decentralized agents, which may be infeasible under limited connectivity. Generation or online refinement of decomposition candidates, formal sample-complexity bounds for decomposition learning loops, and extension to hierarchical or instantiated agent/task structures are open research directions (Shah et al., 19 Feb 2025, Ardon et al., 2023, Furelos-Blanco et al., 2022).

A plausible implication is that as environments and agent teams scale in complexity, hierarchical and data-driven RM decomposition will be essential for tractable deep RL in the presence of long-horizon, delayed, or structured rewards.