Hierarchical Reward Machines in RL
- Hierarchical Reward Machines are formal models that structure reward functions as modular, recursive finite-state automata to decompose complex tasks into manageable subtasks.
- They enable efficient policy learning and task decomposition by integrating options-based reinforcement learning with temporal abstraction and automata induction.
- HRMs boost convergence speed, scalability, and transfer learning in both single-agent and multi-agent settings by reusing learned subpolicies across dynamic environments.
Hierarchical Reward Machines (HRMs) formalize and operationalize the hierarchical structure of reward functions in reinforcement learning (RL) by representing tasks through interconnected finite-state automata, enabling the decomposition of complex objectives into modular, reusable, and temporally abstracted subtasks. HRMs generalize standard reward machines, which expose the internal structure of (possibly non-Markovian) reward functions, by allowing machines to invoke one another recursively, reflecting deep task hierarchies intrinsic to long-horizon and multi-stage RL domains. The framework spans single-agent, multi-agent, and lifelong learning scenarios, supports both hand-crafted and learned automata, and interfaces naturally with options-based hierarchical RL approaches.
1. Formal Definition and Structure
An HRM consists of a set of reward machines , where each is an automaton: . Here is a finite set of states, is the initial state, is the accepting set, is a transition function over a high-level alphabet (e.g., labels for event propositions), and specifies the reward output on transitions.
Hierarchical composition is achieved by allowing transitions in a parent reward machine (e.g., ) to invoke a child reward machine —effectively making a callable subroutine representing a subtask or skill. When a call option is activated, execution control is transferred to the submachine until its accepting or terminal condition is reached, upon which control and (optionally) state information are returned to the parent. This extends the representational power from flat automata to a recursive call graph or stack of automata (Furelos-Blanco et al., 2022).
This construction enables HRMs to express complex temporal requirements, including sequential, conditional, or concurrent task dependencies, through explicit modularization. In multi-agent settings, each submachine may correspond to a team, subgroup, or individual agent, with appropriate partitioning of state and event spaces (Zheng et al., 8 Mar 2024, Neary et al., 2020).
2. Methodological Foundations
HRMs support a hierarchical RL paradigm in which each call to a submachine is treated as an option in the sense of the options framework: an option is a temporally extended action defined by a policy, initiation set, and termination condition. Specifically, transitions labeled as formula options correspond to primitive, state-based rewards, whereas call options initiate execution of a sub-HRM.
Learning proceeds as follows:
- The agent maintains policies (e.g., Q-networks or actor-critic modules) at each HRM level.
- Upon entering a call option (i.e., invoking a sub-RM), the call stack is updated, and the agent’s behavior is controlled by the policy associated with that subtask.
- When the sub-RM reaches a terminal state, control returns to the parent automaton’s state.
- Updates to value functions and policies can occur at multiple hierarchical levels according to the current stack context (Furelos-Blanco et al., 2022).
In multi-agent domains, task decomposition is realized by hierarchical assignment of subtasks to agent coalitions and by synchronizing events across local RMs via bisimulation or projection operators, to guarantee that local subtask completion entails global task accomplishment (Neary et al., 2020, Zheng et al., 8 Mar 2024). In lifelong learning, HRMs enable cumulative accumulation and transfer of previously learned subpolicy Q-functions, as subtasks are discovered and recomposed via logical progression or temporal operators (e.g., “then,” ) in temporal logic task specifications (Zheng et al., 2021).
The HRM learning problem can be addressed either by supervised synthesis from symbolic traces (e.g., using ILASP for logic program induction) or by joint inference from experience using automata learning procedures that construct minimal consistent HRMs based on counterexamples (Furelos-Blanco et al., 2022, Xu et al., 2019).
3. Task Decomposition, Temporal Abstraction, and Modularity
Central to the HRM framework is its ability to modularize the reward function, and thus the RL problem, along well-defined subtasks:
- Each submachine represents a temporally extended, semantically meaningful subgoal or chunk of the overall task (e.g., “gather wood,” “craft plank,” “build shelter”).
- The parent automaton flexibly sequences (and potentially conditions) submachine invocations, supporting both sequential and conditional task structures (e.g., if-then-else, loops).
- Subtask modularity facilitates the reuse of learned policies or value functions across tasks/domains, as well as accelerates transfer and lifelong learning.
- By exposing this modularity to the agent, the effective decision horizon and exploration burden are reduced, and off-policy updates can be performed using counterfactual reasoning (viz., updating subtask value functions even when only a higher-level task was observed) (Icarte et al., 2020).
In multi-agent HRMs, the reward decomposition accounts for explicit agent dependencies and may support temporal or concurrent subtask execution (Zheng et al., 8 Mar 2024).
4. Algorithmic, Mathematical, and Implementation Aspects
HRM-based RL extends the standard update rules to incorporate the automaton’s state as context:
- The product-state MDP is , where is the environmental state and is the HRM node on the current call stack.
- Q-learning and policy update formulas are generalized to respect the hierarchical control flow and modular reward function:
where and .
- For higher-level call options, value updates are performed over temporally extended trajectories, accumulating discounted composite rewards and conditioning on stack terminations (Furelos-Blanco et al., 2022, Zheng et al., 8 Mar 2024).
- Sample efficiency is often improved by sharing experience data across compatible HRM contexts (e.g., using counterfactual reasoning or automaton state sharing (Icarte et al., 2020)).
- Automated reward shaping techniques can be derived by computing potential-based shaping terms on the HRM graph to accelerate learning without affecting policy optimality (Icarte et al., 2020).
- Inference of HRM structure can be performed via automata induction methods, subject to constraints on hierarchical calls, determinism, and minimality (Furelos-Blanco et al., 2022).
5. Empirical Results and Comparative Impact
Experiments demonstrate that HRMs yield significant gains in convergence speed, final policy quality, and scalability over flat (non-hierarchical) reward machine approaches and over traditional monolithic or black-box reward representations:
- Handcrafted HRMs in “CraftWorld,” “WaterWorld,” and Minecraft-like domains allow agents to solve tasks that are unmanageable with equivalent flat RMs, due to exponential reduction in automaton state space by leveraging submachine reuse (Furelos-Blanco et al., 2022).
- In cooperative multi-agent settings, the MAHRM framework delivers robust learning even as agent number and subtask interdependence increase, outperforming alternatives that lack hierarchical or modular task decomposition (Zheng et al., 8 Mar 2024).
- HRMs facilitate transfer in lifelong reinforcement learning by supporting knowledge reuse at the subtask level, where SLTL or LTL formulas specify new tasks and previously learned Q-functions are composited via logical operator-induced value composition (Zheng et al., 2021).
- In deep RL, modular architectures implementing HRM logic (e.g., additional channels for automaton state encoding) allow for integration with convolutional or MLP-based policy networks (Furelos-Blanco et al., 2022, Camacho et al., 2020).
- HRMs outperform both independent and centralized Q-learning (not equipped with hierarchical decomposition) on a variety of single- and multi-agent benchmarks (Neary et al., 2020, Zheng et al., 8 Mar 2024).
6. Limitations, Open Challenges, and Future Directions
Practical deployment of HRMs currently requires substantial prior knowledge for reward machine and submachine specification, as well as careful engineering of the high-level event vocabulary and labeling functions. Automatizing HRM induction from experience, generalizing to partially observable or noisy abstraction settings, and integrating HRMs with symbolic planning or temporal logic specification languages are open research directions (Furelos-Blanco et al., 2022, Li et al., 2022, Icarte et al., 2021). Especially in multi-agent systems, scaling HRM induction and credit assignment with overlapping or concurrent subtasks remains a technical challenge (Zheng et al., 8 Mar 2024).
Another important avenue is the incorporation of interpretable, symbolically structured HRMs with deep learning architectures, possibly via latent variable models or hybrid neural-symbolic integration (Zhou et al., 2022). HRMs also serve as a theoretical bridge connecting classical hierarchical RL, options, reward shaping, logic-based task decomposition, and automata-based reward specification—suggesting a unifying perspective for complex, structured reinforcement learning environments.
Table: HRM Features in Contemporary RL
Aspect | HRM Approach | Source(s) |
---|---|---|
Modular hierarchy | Submachines and recursive calls | (Furelos-Blanco et al., 2022, Zheng et al., 8 Mar 2024) |
Task decomposition | Temporal/semantic subgoals | (Icarte et al., 2020, Zheng et al., 2021) |
Multi-agent support | Subtask allocation, concurrency | (Zheng et al., 8 Mar 2024, Neary et al., 2020) |
Policy learning | Option-based, Q-learning per module | (Furelos-Blanco et al., 2022, Icarte et al., 2020) |
Structure learning | Automata induction, ILASP, SLTL | (Furelos-Blanco et al., 2022, Zheng et al., 2021) |
RL algorithm | Product MDP, modular value updates | (Icarte et al., 2020, Furelos-Blanco et al., 2022) |
Transfer/lifelong | Value composition, subpolicy reuse | (Zheng et al., 2021, Camacho et al., 2020) |
This structure highlights the central role of HRMs in modularizing reward function design, enabling algorithmic task decomposition, supporting efficient policy synthesis, and providing the formal substrate for interpretable, scalable RL in complex domains.