Papers
Topics
Authors
Recent
Search
2000 character limit reached

Hierarchical Reward Machines

Updated 18 May 2026
  • Hierarchical Reward Machines (HRMs) are reward machines enhanced with modular and hierarchical task decomposition for efficient non-Markovian reward specification.
  • They reduce state–action and automaton complexity exponentially by decomposing tasks, enabling faster convergence in long-horizon and sparse-reward domains.
  • HRMs facilitate scalable multi-agent coordination and multi-step reasoning across both tabular and deep reinforcement learning frameworks.

Hierarchical Reward Machines (HRMs) generalize the Reward Machine (RM) formalism by integrating hierarchical task decomposition and modularity into reward specification, learning, and credit assignment. HRMs have attained prominence for their ability to encode both non-Markovian reward structure and explicit subtask hierarchy, yielding exponential reductions in automaton and state–action space complexity for long-horizon, sparse-reward, or multi-agent reinforcement learning domains. Recent advancements have broadened their application from tabular and single-agent settings to cooperative multi-agent systems and multi-step evaluators for LLM reasoning, evidencing significant empirical speed-ups, robustness, and broad generalization (Zheng et al., 2024, Wang et al., 16 Mar 2025, Furelos-Blanco et al., 2022).

1. Formalism and Definition

The foundational element of the HRM framework is the Reward Machine: a finite-state automaton over high-level, temporally extended task events (propositions), with rewards emitted on state transitions as a function of the environment’s labeled action-state history. An RM is defined as the tuple

M=(U,u0,F,δ,r,Σ,ℓ)M = (U, u_0, F, \delta, r, \Sigma, \ell)

where UU is the set of RM states, u0u_0 the initial state, FF the set of terminal (accepting) states, δ:U×Σ→U\delta: U \times \Sigma \rightarrow U a deterministic transition function based on the current set of atomic propositions (Σ⊆2P\Sigma \subseteq 2^P), r:U×Σ→Rr: U \times \Sigma \to \mathbb{R} the reward function, and ℓ:S×A×S→Σ\ell: S \times A \times S \to \Sigma provides the high-level event labeling of each environment transition (Zheng et al., 2024).

Hierarchical Reward Machines extend this by organizing the set of propositions PP or events into KK levels, forming a directed acyclic graph. Each non-primitive proposition UU0 (for UU1) is defined in terms of child nodes UU2—each representing a subtask that depends on (possibly concurrent) achievement of its children. For every UU3, a submachine (RM) UU4 is defined independently, facilitating decomposition and modularity.

An alternative formalization endows the RM with the ability to recursively invoke other RMs as callable subroutines, giving rise to call stacks and compositional semantics over multiple nested machines. The resulting HRM is thus a hierarchy UU5 with root RM UU6 and leaf UU7 for immediate returns (Furelos-Blanco et al., 2022).

HRMs are flat-equivalent to RMs: any HRM of height UU8 is equivalent to a flat RM whose automaton states correspond to all reachable call-stack configurations, but the latter may incur exponential blowup in the number of states and transitions (Furelos-Blanco et al., 2022).

2. Learning and Policy Execution Algorithms

HRM execution and learning leverage the options framework. Each submachine or subtask RM may be called as an option, with its own initiation conditions (DNF guards over propositions), policy, and termination conditions. Two update levels coexist: one at the top-level RM (root) and another at every callable sub-RM, with separate experience replay buffers and value/policy functions.

The general exploitation framework involves filling the options stack by recursively selecting call options until a “terminal” formula option is reached, executing actions according to the current subpolicy, and then updating Q-functions and traversing up/down the call stack as options terminate (Furelos-Blanco et al., 2022). Policy learning for each subtask UU9 can be performed via tabular Q-learning or DQN, supporting both primitive and composite levels (Zheng et al., 2024).

MAHRM, a multi-agent HRM algorithm, implements recursive top-down policy execution: each policy u0u_00 for subtask u0u_01 at level u0u_02 chooses an option (tuple of lower-level subtask calls and agent assignments), and for primitives, selects physical actions in the environment. Backpropagation of reward and state updates occurs up the hierarchy; for composite subtasks, multi-step Q-learning collects and propagates rewards over the execution of the option until completion (Zheng et al., 2024).

Inductive learning of HRMs from traces uses curriculum-guided episodes interleaved with ILASP-based symbolic automaton induction for root transitions, leveraging traces labeled by reward achievement and failure to infer or revise submachine transitions (Furelos-Blanco et al., 2022).

3. Theoretical Properties and Complexity

Under standard RL conditions (finite state/action spaces, diminishing learning rates), HRM-based Q-learning converges to optimal option-value functions for each subtask. The key complexity benefit derives from compositional factorization: if u0u_03 denotes the size of a flat RM automaton, and u0u_04 the size of each sub-RM in the HRM hierarchy, the exponential scaling in u0u_05 with u0u_06 is reduced to linear or polynomial in u0u_07. Thus, both the automaton complexity and the overall joint state–action space shrink from u0u_08 to u0u_09 (Zheng et al., 2024, Furelos-Blanco et al., 2022). Theoretical results guarantee flat equivalence while establishing substantial runtime and sample efficiency advantages.

4. Applications in Multi-Agent and Multi-Step Reasoning

In cooperative MARL, HRMs support hierarchical decomposition of complex tasks into agent-assignable subtasks. MAHRM handles concurrent, interdependent high-level events—assigning submachines to subsets of agents and orchestrating their coordination through hierarchical Q-learning. Empirical evaluation on navigation, collaborative crafting (Minecraft), and tightly coupled coordination (“Pass” domain, button-door tasks) demonstrates MAHRM’s superiority in convergence speed and stability over baseline approaches (Independent QRM, Decentralized QRM, Modular MAHRL), particularly in settings with concurrent events and interdependencies (Zheng et al., 2024).

HRM architectures have also been adapted for multi-step reward modeling in LLM reasoning (Wang et al., 16 Mar 2025). Here, HRMs provide both fine-grained (step-wise) and coarse-grained (multi-step) rewards over the nodes of a reasoning tree: individual reasoning steps are scored by a binary RM, and adjacent pairs of steps are merged and re-evaluated to capture self-correction and coherence. This mitigates reward-hijacking (reward hacking) by ensuring later corrections are recognized, and that overall reasoning segments maintain logical soundness. Application of Hierarchical Node Compression (HNC)—merging consecutive reasoning nodes in Monte Carlo Tree Search (MCTS) trajectories—augments data diversity and robustness.

5. Empirical Evidence

Empirical evaluation demonstrates marked advantages of HRMs over both flat RMs and conventional reward models across multiple domains and metrics. In tabular and Deep RL settings, non-flat HRMs achieve FF0–FF1 faster convergence relative to flat HRMs, and FF2–FF3 speedup over other automaton-based methods (CRM, DeepSynth, JIRP, LRM) on compositional tasks (Furelos-Blanco et al., 2022). MAHRM outperforms baselines by FF4–FF5 in convergence steps and is uniquely stable across random seeds in multi-agent tasks (Zheng et al., 2024). In the reward modeling of LLM reasoning, HRMs with HNC achieve higher and more stable evaluation metrics (e.g. best-of-FF6 accuracy on PRM800K: HRM at FF7 achieves FF8, vs. FF9 for PRM; generalization improvement up to δ:U×Σ→U\delta: U \times \Sigma \rightarrow U0 on Math500) and consistent gains in cross-domain transfer (Wang et al., 16 Mar 2025).

Domain/Task HRM Variant Convergence Steps/Accuracy Baseline Baseline Metric
Navigation MAHRM δ:U×Σ→U\delta: U \times \Sigma \rightarrow U1 IQRM δ:U×Σ→U\delta: U \times \Sigma \rightarrow U2
Minecraft MAHRM δ:U×Σ→U\delta: U \times \Sigma \rightarrow U3 IQRM, MOD δ:U×Σ→U\delta: U \times \Sigma \rightarrow U4, δ:U×Σ→U\delta: U \times \Sigma \rightarrow U5
Pass MAHRM δ:U×Σ→U\delta: U \times \Sigma \rightarrow U6 IQRM/MOD fail
PRM800K (N=16) HRM (LLM Reward Modeling) δ:U×Σ→U\delta: U \times \Sigma \rightarrow U7 PRM δ:U×Σ→U\delta: U \times \Sigma \rightarrow U8
Math500 HRM (LLM Reward Modeling) δ:U×Σ→U\delta: U \times \Sigma \rightarrow U9 PRM Σ⊆2P\Sigma \subseteq 2^P0

MAHRM, LHRM, and HRM variants consistently outperform competitors on both synthetic and real-world tasks, especially in long-horizon or high-branching factor domains.

6. Limitations and Open Challenges

The current HRM paradigm depends crucially on the availability of a human-specified or hand-engineered hierarchy—both in terms of subtask decomposition and atomic event propositions. Automatic discovery or induction of hierarchy, especially from high-dimensional input (e.g., pixels), unknown propositions, or noisy labeling, remains largely unsolved (Zheng et al., 2024, Furelos-Blanco et al., 2022). While HRMs have been theoretically and empirically demonstrated to be more sample- and runtime-efficient than flat RMs, their actual efficiency is contingent on the granularity and correctness of the provided hierarchy.

Potential extensions include HRMs for continuous domains, online subtask discovery, noisy trace integration, and formal regret analysis. The transfer of HRM frameworks to new settings, including non-cooperative or adversarial MARL and more general structured reasoning tasks, is an active area of research.

7. Broader Impact and Significance

Hierarchical Reward Machines have redefined the landscape of non-Markovian reward specification, long-horizon credit assignment, and modular policy synthesis. Their principled blend of automaton-based temporal logic, options HRL, and multi-agent or multi-step evaluation yields scalable solutions across RL, MARL, and sequential reasoning in LLMs (Zheng et al., 2024, Wang et al., 16 Mar 2025, Furelos-Blanco et al., 2022). Their main limitation remains the requirement for externally specified hierarchy, but ongoing research on induction and transferability suggests expanding reach. HRMs are applicable in robotics (multi-robot coordination), scheduling, and any domain with compositional, logically structured subtasks. Theoretical and empirical evidence demonstrate exponential reductions in complexity and improved robustness, positioning HRMs as a central construct in modular and structured reward specification and learning.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hierarchical Reward Machines (HRMs).