Hierarchical Reinforcement Learning Formalism

Updated 10 December 2025

Hierarchical reinforcement learning formalism is a framework that decomposes complex sequential decision problems into modular subtasks using temporal abstraction.
It employs abstraction mappings and option synthesis to create transferable low-level policies and boost sample efficiency.
The formalism integrates hierarchical Bellman equations and policy gradient methods to ensure convergence and provide optimality guarantees in complex environments.

Hierarchical reinforcement learning (HRL) formalism provides a mathematical and algorithmic framework for decomposing large sequential decision problems into modular, temporally or spatially abstracted subtasks. Central to HRL is the use of temporal abstraction—policy decomposition via options, subtasks, or skills—leading to improved sample efficiency, transferable solutions, and structured integration of symbolic knowledge. Formal HRL frameworks define precise mappings from low-level Markov Decision Processes (MDPs) to high-level abstract decision processes (ADPs), often leveraging planning models, constrained option policies, or automata-theoretic constructs. This article reviews the core mathematical apparatus, abstraction-relations, option induction, reward mechanisms, and optimality guarantees in contemporary HRL formalism.

1. HRL: Mathematical Structures and Formal Definitions

HRL builds upon the standard finite or continuous MDP defined by $M = (S, A, P, r, s_0, G, \gamma)$ . Various HRL formalisms specify a high-level abstraction over $M$ using:

Subgoals or Regions: Partitioning $S$ into disjoint or overlapping sets $G = \{g_1, ..., g_K\}$ to serve as abstract states for a higher-level process. Each region may act as an initiation or termination domain of an option (Jothimurugan et al., 2020).
Options/Temporally Extended Actions: An option is defined as a triple $\omega = (I, \pi, \beta)$ where $I$ is the initiation set, $\pi$ is the intra-option policy, and $\beta$ is the termination function. For instance, options can be systematically derived from planning operators in an AI planning model $\Pi = (V, O, s'_0, G')$ , where preconditions and effects map to initiation and termination sets via an abstraction function $L: S \to S'$ (Lee et al., 2022).
Hierarchical Decompositions: Subtasks arranged in DAGs (e.g., MAXQ (Jonsson et al., 2016), batch HRL (Zhao et al., 2016)) or options graphs, with each subtask/subpolicy solving a restricted SMDP (Semi-Markov Decision Process) on $M$ or an abstracted version thereof.
Product or Augmented MDPs: When specifications are given symbolically (e.g., temporal logic or reward machines), the HRL formalism augments the state space to $S \times Q$ with automaton/control state $Q$ representing task progress (Li et al., 2017, Furelos-Blanco et al., 2022).

This formalism enables the definition of high-level policies operating over abstracted state-action spaces, where each abstract action invokes a learned or planned low-level policy segment.

2. Abstraction Mappings and Option Synthesis

Formal mapping from ground MDPs to abstract decision processes is driven by the choice of abstraction function and the realizability of abstract actions:

Planning-Based Abstraction: In PaRL, an abstraction $L: S \to S'$ relates MDP states to symbolic planning states; operators $o \in O$ induce options $\omega_o$ with $I_o = \{s \in S | \text{pre}(o) \subseteq L(s)\}$ , $\beta_o(s) = 1$ iff $\text{eff}(o) \subseteq L(s)$ (Lee et al., 2022).
Second-Order Abstractions and Realizability: Recent work formalizes $\varphi$ -realizable abstractions $(M_H, \varphi)$ , with $M_H = (S_H, A_H, P_H, R_H, \gamma_H)$ , and $P_H$ a second-order kernel $P_H(s'_H | s_{H,p}, s_H, a_H)$ , ensuring that for each high-level transition there exists a compositional option in the original MDP matching the induced occupancy and value constraints (Cipollone et al., 4 Dec 2025).
ADP Non-Markovity and Markovization: When options do not induce identical transitions from all points in a region, resulting ADPs can be non-Markovian. Sufficient conditions for Markovity involve careful design of entry, exit, and initiation sets (Jothimurugan et al., 2020, Cipollone et al., 4 Dec 2025).
Automata- and Reward-Machine-Guided HRL: Logical or automata specifications (as in scTLTL or reward machines) induce a product MDP; intrinsic rewards are generated for automaton progress, and hierarchical options arise as sub-policies for automata substates or callable submachines (Li et al., 2017, Furelos-Blanco et al., 2022).

Option policies can be synthesized by solving constrained MDPs over abstract blocks, enforcing both optimal block reward and exit distribution matching the abstract transition model, typically using PAC-safe policy optimization (Cipollone et al., 4 Dec 2025).

3. Hierarchical Bellman Equations and Policy Gradient Objectives

The optimization objectives and Bellman recursions in HRL extend standard MDP methods by incorporating intra-option policies, higher-level policies, and shaped rewards:

SMDP Bellman Equations: For option-based HRL, the hierarchy induces an SMDP $M' = (S, \Omega, \bar{P}, \bar{R}, s_0, G, \gamma)$ with options $\Omega$ . The value functions satisfy:

$V^\mu(s) = \sum_o \mu(o|s) Q^\mu(s, o), \qquad Q^\mu(s, o) = E\left[\sum_{t=0}^{k-1} \gamma^t \bar{r}(s_t, a_t) + \gamma^k V^\mu(s_k) | s_0 = s, o \right]$

where $\mu$ is the high-level (option-selection) policy (Lee et al., 2022).

Policy-Gradient and Auxiliary Objectives: In multitier setups, policy gradient updates for both high- and low-level policies are possible, with joint or separate objectives. Advantage-based auxiliary rewards can shape low-level learning and guarantee monotonic improvement in hierarchical return (Li et al., 2019).
Hierarchical Backups and Credit Assignment: Hierarchical value estimation naturally results in skip-connections and nonlocal credit assignment, formalized as skip-step TD errors, hierarchical n-step returns, or eligibility traces (e.g., HierQ $_k(\lambda)$ ), which can drastically reduce variance and improve credit propagation (Vries et al., 2022).

4. Reward Shaping, Intrinsic Rewards, and Sample Efficiency

HRL formalisms routinely deploy intrinsic rewards or shaped objectives to guide option policies and stabilize the learning of subpolicies:

Intrinsic Rewards from Abstract Frames: The PaRL approach defines augmented rewards penalizing violations of planning operator “frames”—the subset of variables that should remain invariant or match the operator's effect—so that low-level policies better implement symbolic transitions (Lee et al., 2022).
Manager-Worker (Feudal) Rewards: Hierarchies with explicit manager-agent decoupling employ modules that issue subgoals, with the worker rewarded for reaching subgoals and the manager rewarded for environment-level progress, often using different temporal and spatial abstraction granularities (Johnson et al., 2023).
Timed Subgoals and Stationarity: HRL methods based on timed subgoals (e.g., HiTS) enforce that the high-level policy interacts with the environment at regular intervals, stabilizing replay buffer distributions and enabling reliable off-policy learning even in dynamic environments (Gürtler et al., 2021).
Automata-Driven Rewards: The progress of an automaton or reward machine is used as a source of dense intrinsic rewards, turning logical task structure into stepwise learning signals for skill acquisition (Li et al., 2017, Furelos-Blanco et al., 2022).

Such mechanisms support stronger supervision and targeted exploration, yielding increased sample efficiency and facilitating zero-shot transfer or skill composition.

5. Optimality, State Abstraction, and Theoretical Guarantees

Recent HRL theory provides explicit suboptimality, convergence, and sample-complexity guarantees under various settings:

Near-Optimality via Realizable Abstraction: If the abstraction is realizable—i.e., every high-level transition can be implemented by an appropriately constructed option—the composed low-level policy can be boundedly suboptimal in $M$ :

$V^*(s) - V^{\pi_L}(s) \leq \frac{\epsilon_R + |S_H| \epsilon_T}{(1-\gamma)^2 (1 - \gamma_H)}$

for option/value errors $(\epsilon_R, \epsilon_T)$ and finite abstract decision process $M_H$ (Cipollone et al., 4 Dec 2025).

State-Space Reduction: Restriction to option “frames” or relevant state features can dramatically reduce the dimensionality of the subtask MDPs, supporting both analytical tractability and practical transfer (Lee et al., 2022, Jonsson et al., 2016).
Sample Complexity and PAC Bounds: Provided CMDP option synthesis admits PAC-safe solutions, the sample complexity for learning a hierarchical policy via constrained MDPs is $O\left( |S_H|^2 |A_H| r(\epsilon_{\text{realizer}}) \right)$ , with convergence to an $\epsilon$ -optimal ground policy with high probability (Cipollone et al., 4 Dec 2025).
Regret Bounds in Meta-Hierarchy Discovery: In meta-RL settings, it is possible to learn a latent optimal hierarchy online with polynomial sample complexity, and to provably leverage the identified hierarchy for significant regret reduction in downstream tasks (Chua et al., 2021).

A recurring theme is the necessity of enforcing Markovity at the abstract level, frequently achieved via entry-exit-aware option design and second-order transition kernels.

6. Algorithmic Schemes and Practical Implementations

HRL algorithms realize the above formalism in structured workflows:

Call-and-Return with Online Re-Planning: The agent computes high-level plans (e.g., via a classical planner or value iteration) to generate option sequences, executing and learning low-level option policies using rollout-generated experience, with intrinsic reward computation and replay buffer population (Lee et al., 2022).
Alternating Planning and Option Learning: Alternating AVI (A-AVI) and robust AVI (R-AVI) interleave option policy learning (reachability-based RL in subgoal-induced MDPs) with abstract planning (high-level AVI), updating the state-distribution to mitigate the distributional shift arising from non-Markovity (Jothimurugan et al., 2020).
Bottom-Up Hierarchical Updates: Tabular approaches such as HQI iteratively solve subtasks in bottom-to-root order, propagating value information, allowing for off-policy learning from fixed datasets and flexible state abstraction (Zhao et al., 2016).
Automata/Reward-Machine Option Induction: Automata-guided HRL constructs options from automata edges or reward-machine calls, with intrinsic rewards and Q-learning tailored for each automaton or option context, and skill composition via automaton product and Q-function initialization (Li et al., 2017, Furelos-Blanco et al., 2022).
Feudal/Manager-Worker Systems: Multi-level architectures use a manager for high-level goal selection at coarse temporal or spatial scale, and workers executing fine-grained policies towards subgoals over fixed or adaptive intervals, with decoupled learning signals and buffers (Johnson et al., 2023, Gürtler et al., 2021).

These algorithmic patterns are tailored for sample efficiency, robustness to nonstationarity, symbolically grounded RL, and reliable credit assignment.

7. Implications, Strengths, and Open Challenges

HRL formalism, as developed across recent literature, offers:

Sample Efficiency and Exploration: Temporal abstraction decomposes the exploration problem, reduces effective horizon, and enables focused exploration within local subspaces, improving convergence times in long-horizon and sparse-reward settings (Lee et al., 2022, Jothimurugan et al., 2020, Jonsson et al., 2016).
Modularity and Transfer: Options and hierarchical skills, once learned, can be recomposed for novel tasks, supporting transfer and zero-shot learning (Zhao et al., 2016, Furelos-Blanco et al., 2022).
Symbolic Knowledge Integration: By anchoring subgoals and transitions in interpretable planning models, automata, or logical formulas, HRL can leverage external knowledge and support human interpretability (Lee et al., 2022, Li et al., 2017).
Addressing Non-Markovity: Second-order abstractions and entry-exit-aware option design resolve previously intractable non-Markovian artifacts in ADPs, enabling planning at the abstract level with rigorous guarantees (Cipollone et al., 4 Dec 2025).
Skill Composition and Task Flexibility: Automata-based and reward-machine HRL formalism support efficient composition or modification of task specifications, enabling rapid adaptation without retraining from scratch (Li et al., 2017, Furelos-Blanco et al., 2022).

A principal open challenge remains the construction or learning of “good” abstractions and subgoal structures in environments lacking clean bottlenecks or planning models, and achieving scalable real-time synthesis of options in high-dimensional, stochastic, or partially observable domains. Integration of PAC-optimal, robust abstraction learning (Cipollone et al., 4 Dec 2025, Chua et al., 2021) and advanced functional approximation remains an active area of research.

References:

"Hierarchical Reinforcement Learning with AI Planning Models" (Lee et al., 2022)
"Abstract Value Iteration for Hierarchical Reinforcement Learning" (Jothimurugan et al., 2020)
"Algorithms for Batch Hierarchical Reinforcement Learning" (Zhao et al., 2016)
"Hierarchical Linearly-Solvable Markov Decision Problems" (Jonsson et al., 2016)
"Hierarchical Reinforcement Learning with Advantage-Based Auxiliary Rewards" (Li et al., 2019)
"Hierarchical Reinforcement Learning via Advantage-Weighted Information Maximization" (Osa et al., 2019)
"Hierarchical Reinforcement Learning for Temporal Pattern Prediction" (Johnson et al., 2023)
"Realizable Abstractions: Near-Optimal Hierarchical Reinforcement Learning" (Cipollone et al., 4 Dec 2025)
"Hierarchical Reinforcement Learning with Timed Subgoals" (Gürtler et al., 2021)
"Provable Hierarchy-Based Meta-Reinforcement Learning" (Chua et al., 2021)
"On Credit Assignment in Hierarchical Reinforcement Learning" (Vries et al., 2022)
"Automata-Guided Hierarchical Reinforcement Learning for Skill Composition" (Li et al., 2017)
"Hierarchies of Reward Machines" (Furelos-Blanco et al., 2022)