Hierarchical Reward Machines
- Hierarchical Reward Machines (HRMs) are reward machines enhanced with modular and hierarchical task decomposition for efficient non-Markovian reward specification.
- They reduce state–action and automaton complexity exponentially by decomposing tasks, enabling faster convergence in long-horizon and sparse-reward domains.
- HRMs facilitate scalable multi-agent coordination and multi-step reasoning across both tabular and deep reinforcement learning frameworks.
Hierarchical Reward Machines (HRMs) generalize the Reward Machine (RM) formalism by integrating hierarchical task decomposition and modularity into reward specification, learning, and credit assignment. HRMs have attained prominence for their ability to encode both non-Markovian reward structure and explicit subtask hierarchy, yielding exponential reductions in automaton and state–action space complexity for long-horizon, sparse-reward, or multi-agent reinforcement learning domains. Recent advancements have broadened their application from tabular and single-agent settings to cooperative multi-agent systems and multi-step evaluators for LLM reasoning, evidencing significant empirical speed-ups, robustness, and broad generalization (Zheng et al., 2024, Wang et al., 16 Mar 2025, Furelos-Blanco et al., 2022).
1. Formalism and Definition
The foundational element of the HRM framework is the Reward Machine: a finite-state automaton over high-level, temporally extended task events (propositions), with rewards emitted on state transitions as a function of the environment’s labeled action-state history. An RM is defined as the tuple
where is the set of RM states, the initial state, the set of terminal (accepting) states, a deterministic transition function based on the current set of atomic propositions (), the reward function, and provides the high-level event labeling of each environment transition (Zheng et al., 2024).
Hierarchical Reward Machines extend this by organizing the set of propositions or events into levels, forming a directed acyclic graph. Each non-primitive proposition 0 (for 1) is defined in terms of child nodes 2—each representing a subtask that depends on (possibly concurrent) achievement of its children. For every 3, a submachine (RM) 4 is defined independently, facilitating decomposition and modularity.
An alternative formalization endows the RM with the ability to recursively invoke other RMs as callable subroutines, giving rise to call stacks and compositional semantics over multiple nested machines. The resulting HRM is thus a hierarchy 5 with root RM 6 and leaf 7 for immediate returns (Furelos-Blanco et al., 2022).
HRMs are flat-equivalent to RMs: any HRM of height 8 is equivalent to a flat RM whose automaton states correspond to all reachable call-stack configurations, but the latter may incur exponential blowup in the number of states and transitions (Furelos-Blanco et al., 2022).
2. Learning and Policy Execution Algorithms
HRM execution and learning leverage the options framework. Each submachine or subtask RM may be called as an option, with its own initiation conditions (DNF guards over propositions), policy, and termination conditions. Two update levels coexist: one at the top-level RM (root) and another at every callable sub-RM, with separate experience replay buffers and value/policy functions.
The general exploitation framework involves filling the options stack by recursively selecting call options until a “terminal” formula option is reached, executing actions according to the current subpolicy, and then updating Q-functions and traversing up/down the call stack as options terminate (Furelos-Blanco et al., 2022). Policy learning for each subtask 9 can be performed via tabular Q-learning or DQN, supporting both primitive and composite levels (Zheng et al., 2024).
MAHRM, a multi-agent HRM algorithm, implements recursive top-down policy execution: each policy 0 for subtask 1 at level 2 chooses an option (tuple of lower-level subtask calls and agent assignments), and for primitives, selects physical actions in the environment. Backpropagation of reward and state updates occurs up the hierarchy; for composite subtasks, multi-step Q-learning collects and propagates rewards over the execution of the option until completion (Zheng et al., 2024).
Inductive learning of HRMs from traces uses curriculum-guided episodes interleaved with ILASP-based symbolic automaton induction for root transitions, leveraging traces labeled by reward achievement and failure to infer or revise submachine transitions (Furelos-Blanco et al., 2022).
3. Theoretical Properties and Complexity
Under standard RL conditions (finite state/action spaces, diminishing learning rates), HRM-based Q-learning converges to optimal option-value functions for each subtask. The key complexity benefit derives from compositional factorization: if 3 denotes the size of a flat RM automaton, and 4 the size of each sub-RM in the HRM hierarchy, the exponential scaling in 5 with 6 is reduced to linear or polynomial in 7. Thus, both the automaton complexity and the overall joint state–action space shrink from 8 to 9 (Zheng et al., 2024, Furelos-Blanco et al., 2022). Theoretical results guarantee flat equivalence while establishing substantial runtime and sample efficiency advantages.
4. Applications in Multi-Agent and Multi-Step Reasoning
In cooperative MARL, HRMs support hierarchical decomposition of complex tasks into agent-assignable subtasks. MAHRM handles concurrent, interdependent high-level events—assigning submachines to subsets of agents and orchestrating their coordination through hierarchical Q-learning. Empirical evaluation on navigation, collaborative crafting (Minecraft), and tightly coupled coordination (“Pass” domain, button-door tasks) demonstrates MAHRM’s superiority in convergence speed and stability over baseline approaches (Independent QRM, Decentralized QRM, Modular MAHRL), particularly in settings with concurrent events and interdependencies (Zheng et al., 2024).
HRM architectures have also been adapted for multi-step reward modeling in LLM reasoning (Wang et al., 16 Mar 2025). Here, HRMs provide both fine-grained (step-wise) and coarse-grained (multi-step) rewards over the nodes of a reasoning tree: individual reasoning steps are scored by a binary RM, and adjacent pairs of steps are merged and re-evaluated to capture self-correction and coherence. This mitigates reward-hijacking (reward hacking) by ensuring later corrections are recognized, and that overall reasoning segments maintain logical soundness. Application of Hierarchical Node Compression (HNC)—merging consecutive reasoning nodes in Monte Carlo Tree Search (MCTS) trajectories—augments data diversity and robustness.
5. Empirical Evidence
Empirical evaluation demonstrates marked advantages of HRMs over both flat RMs and conventional reward models across multiple domains and metrics. In tabular and Deep RL settings, non-flat HRMs achieve 0–1 faster convergence relative to flat HRMs, and 2–3 speedup over other automaton-based methods (CRM, DeepSynth, JIRP, LRM) on compositional tasks (Furelos-Blanco et al., 2022). MAHRM outperforms baselines by 4–5 in convergence steps and is uniquely stable across random seeds in multi-agent tasks (Zheng et al., 2024). In the reward modeling of LLM reasoning, HRMs with HNC achieve higher and more stable evaluation metrics (e.g. best-of-6 accuracy on PRM800K: HRM at 7 achieves 8, vs. 9 for PRM; generalization improvement up to 0 on Math500) and consistent gains in cross-domain transfer (Wang et al., 16 Mar 2025).
| Domain/Task | HRM Variant | Convergence Steps/Accuracy | Baseline | Baseline Metric |
|---|---|---|---|---|
| Navigation | MAHRM | 1 | IQRM | 2 |
| Minecraft | MAHRM | 3 | IQRM, MOD | 4, 5 |
| Pass | MAHRM | 6 | IQRM/MOD | fail |
| PRM800K (N=16) | HRM (LLM Reward Modeling) | 7 | PRM | 8 |
| Math500 | HRM (LLM Reward Modeling) | 9 | PRM | 0 |
MAHRM, LHRM, and HRM variants consistently outperform competitors on both synthetic and real-world tasks, especially in long-horizon or high-branching factor domains.
6. Limitations and Open Challenges
The current HRM paradigm depends crucially on the availability of a human-specified or hand-engineered hierarchy—both in terms of subtask decomposition and atomic event propositions. Automatic discovery or induction of hierarchy, especially from high-dimensional input (e.g., pixels), unknown propositions, or noisy labeling, remains largely unsolved (Zheng et al., 2024, Furelos-Blanco et al., 2022). While HRMs have been theoretically and empirically demonstrated to be more sample- and runtime-efficient than flat RMs, their actual efficiency is contingent on the granularity and correctness of the provided hierarchy.
Potential extensions include HRMs for continuous domains, online subtask discovery, noisy trace integration, and formal regret analysis. The transfer of HRM frameworks to new settings, including non-cooperative or adversarial MARL and more general structured reasoning tasks, is an active area of research.
7. Broader Impact and Significance
Hierarchical Reward Machines have redefined the landscape of non-Markovian reward specification, long-horizon credit assignment, and modular policy synthesis. Their principled blend of automaton-based temporal logic, options HRL, and multi-agent or multi-step evaluation yields scalable solutions across RL, MARL, and sequential reasoning in LLMs (Zheng et al., 2024, Wang et al., 16 Mar 2025, Furelos-Blanco et al., 2022). Their main limitation remains the requirement for externally specified hierarchy, but ongoing research on induction and transferability suggests expanding reach. HRMs are applicable in robotics (multi-robot coordination), scheduling, and any domain with compositional, logically structured subtasks. Theoretical and empirical evidence demonstrate exponential reductions in complexity and improved robustness, positioning HRMs as a central construct in modular and structured reward specification and learning.