Papers
Topics
Authors
Recent
Search
2000 character limit reached

Reward Machine Decomposition in RL

Updated 10 May 2026
  • Reward Machine Decomposition is a structured framework that represents non-Markovian reward functions as finite-state automata to modularize complex RL tasks.
  • It employs state- and transition-based decomposition along with projection techniques in multi-agent settings to enable efficient sub-task policy learning.
  • Hierarchical and curriculum-driven methods enhance sample efficiency and interpretability, especially under partial observability and noisy event signals.

Reward machine decomposition is a principled framework for representing, analyzing, and exploiting the temporal and structural properties of non-Markovian reward functions in reinforcement learning (RL). By modeling complex reward structures as automata—reward machines (RMs)—and systematically decomposing these machines, RL tasks can be partitioned into simpler, independently solvable subproblems. This decomposition underpins advances in sample efficiency, credit assignment, interpretability, and scalability, spanning both single-agent and multi-agent domains.

1. Formalism of Reward Machines and Decomposition Principles

A reward machine is a finite-state automaton that processes high-level events (e.g., propositional assignments, symbolic features, or sensor detections) and emits scalar rewards during transitions. A general RM is given as a tuple: R=⟨U,u1,F,AP,δu,δr⟩\mathcal R = \langle U, u_1, F, \mathcal{AP}, \delta_u, \delta_r \rangle where UU is the set of non-terminal RM states, u1u_1 the initial state, FF the set of terminal states, AP\mathcal{AP} the atomic proposition vocabulary, δu\delta_u the RM transition function, and δr\delta_r the immediate reward function. The automaton is "driven" by atomic propositions detected on each environment transition, enabling temporally extended and non-Markovian reward specification (Icarte et al., 2020, Li et al., 2024).

Decomposition exploits this explicit structure by partitioning the original RM RR into a set {Ri}\{R_i\}. In multi-agent settings, decomposition typically projects RR onto agent-specific local event alphabets, yielding local RMs that are equivalent up to a behavioral bisimulation, ensuring that joint satisfaction of all sub-RMs matches the global task (Shah et al., 19 Feb 2025, Ardon et al., 2023, Neary et al., 2020).

2. Decomposition Methodologies and Correctness

2.1 State- and Transition-Based Decomposition

  • Per-state decomposition (QRM/CRM): Associates a separate Q-function to each RM state. Experience is used to update all state-conditioned Q-functions via counterfactual or off-policy updates, allowing modular learning (Icarte et al., 2020).
  • Per-transition/option decomposition (HRM): Each RM transition (edge) corresponds to a temporally extended "option," following the standard options framework. An option is initiated in a given RM state and terminates on transition, facilitating both high-level planning and sub-task policy learning (Furelos-Blanco et al., 2022, Icarte et al., 2020).

2.2 Multi-Agent Decomposition

RM decomposition in multi-agent systems operates via projection/quotient constructions:

  • Each agent UU0 is given a local alphabet UU1, with the global event set UU2. The RM is projected onto UU3 by partitioning states into equivalence classes that cannot be distinguished without observing UU4 (Neary et al., 2020, Shah et al., 19 Feb 2025).
  • Design-time verification via bisimulation checks guarantees that the parallel composition of sub-RMs recovers the global RM's semantics (i.e., UU5) (Neary et al., 2020).
Methodology Decomposition Unit Theoretical Guarantees
QRM/CRM RM states Converges to RM-optimal policy
HRM/options RM transitions/options Hierarchically optimal policies
Projection for multi-agent RMs Event-alphabet quotient Behavioral bisimulation

3. Algorithms and Inference under Partial Observability

Reward machine decomposition is robust to uncertainty, partial observability, and sensor noise. In such cases, inference over the latent RM state must be coupled with policy learning.

  • Belief-state tracking: Maintains a belief distribution UU6 over RM states, approximating UU7. Three main inference strategies are:
    • Naive thresholding (point estimate via noisy classification).
    • Independent belief updating (propagate probabilities via propositional assignment models).
    • Temporal-dependency modeling (TDM): employs history-dependent models to output a belief over RM states directly.

TDM delivers consistent RM-state posteriors under temporal dependence and matches oracle performance in deep RL tasks under partial observability. These inference signals augment RL policies with relevant task memory, offering dramatic sample-efficiency gains over flat or memoryless baselines (Li et al., 2024).

4. Hierarchical and Curriculum-Based Decomposition

Complex tasks often exhibit nested or recursive subgoal structure. Hierarchical reward machines (HRMs) extend the flat RM model by allowing invocation of sub-machines, analogous to function calls:

  • Hierarchical RMs: Each RM can "call" other RMs; execution proceeds via a call stack, maintaining deterministic trace semantics across machines (Furelos-Blanco et al., 2022). The HRM is translatable to an equivalent flat RM of (typically exponentially larger) state space, emphasizing the representational economy of hierarchies.
  • Learning HRMs: A curriculum-based algorithm interleaves RM induction (often by Answer Set Programming, e.g., ILASP) with hierarchical RL, using curriculum selection to focus exploration and counter-examples to trigger structure learning.

Curriculum-driven HRM learning accelerates convergence relative to flat decompositions, especially in long-horizon or sparse-reward environments (Furelos-Blanco et al., 2022).

5. Practical Impact: Sample Efficiency, Credit Assignment, and Interpretability

Empirical studies across domains—from gridworlds and robotic manipulation to multi-agent Overcooked—demonstrate the practical benefits of RM decomposition:

  • Sample efficiency: Modular subproblem learning reduces variance and concentrates learning on subgoals, yielding an order-of-magnitude faster policy acquisition compared to non-modular baselines. Decomposition enables rapid policy improvement with very few demonstrations in vision-based manipulation (Camacho et al., 2020).
  • Credit assignment: In cooperative multi-agent settings, decomposition mitigates the global credit assignment problem by localizing reward to productive agent behaviors. This enables robust policy learning even under codependent or synchronous agent dynamics (e.g., joint coordination in "buttons" and "rendezvous" tasks) (Shah et al., 19 Feb 2025).
  • Interpretability: Subtask RM structures map directly to human-understandable plans or protocols, providing transparency and debuggability beyond end-to-end learning (Ardon et al., 2023).
Domain Baseline With RM Decomposition
Multi-agent grid Flat RL ~fails Parellel RM learning converges
Vision robotics DQN 5%<2000 episodes DQRM >95% in 500 episodes
POMDPs RNN RL stalls TDM nearly matches Oracle

6. Automated Learning of Decompositions

Reward machine decomposition can be achieved without prior domain knowledge. Algorithms exist for mining RMs from experience:

  • Optimization objective (LRM problem): Given traces, infer a minimal RM that decomposes the POMDP into conditionally-independent subproblems such that their joint policy achieves global optimality (Icarte et al., 2021).
  • Simultaneous policy and decomposition learning: Methods such as LOTaD employ task-conditioned architectures and bandit-based selection to search over candidate decompositions and jointly optimize policies (Shah et al., 19 Feb 2025).

These techniques enable decompositional RL in settings with unknown, partially observed, or dynamically changing task structures, subject to computational constraints in the search space of RMs.

7. Limitations and Open Questions

Current RM decomposition frameworks assume access to accurate event labels or labeling functions; learning or refining these abstractions under domain noise remains challenging (Li et al., 2024, Ardon et al., 2023). Communication of RM state or event signals is required for decentralized agents, which may be infeasible under limited connectivity. Generation or online refinement of decomposition candidates, formal sample-complexity bounds for decomposition learning loops, and extension to hierarchical or instantiated agent/task structures are open research directions (Shah et al., 19 Feb 2025, Ardon et al., 2023, Furelos-Blanco et al., 2022).

A plausible implication is that as environments and agent teams scale in complexity, hierarchical and data-driven RM decomposition will be essential for tractable deep RL in the presence of long-horizon, delayed, or structured rewards.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Reward Machine Decomposition.