Hierarchical RL with Analytical Expressions

Updated 21 November 2025

Hierarchical RL with Analytical Expressions is a framework that integrates LMDPs and hierarchical decomposition to enable efficient credit assignment and global policy optimality.
It leverages closed-form analytical solutions via the Cole–Hopf transform and compositional methods to reduce computational and sample complexity.
Empirical evaluations demonstrate faster convergence and superior generalization on structured tasks compared to traditional flat reinforcement learning methods.

Hierarchical Reinforcement Learning (HRL) with Analytical Expressions centers on the intersection of hierarchical decomposition in RL and reinforcement learning formulations—analytically tractable—such as linearly-solvable Markov decision processes (LMDPs) and symbolic variable-slot structures. This approach enables efficient credit assignment, compositionality, globally optimal policy computation, and generalization across domains and tasks with rigorous analytical guarantees. The framework encompasses advances in LMDPs and their hierarchical extensions, analytical decomposability for skill composition, and neural architectures exploiting explicit analytical structure in compositional generalization.

1. Analytical Foundations: Linearly-Solvable MDPs and Analytical Structures

A key analytical foundation is the class of LMDPs. An LMDP operates on a state space $S$ with passive (uncontrolled) transition dynamics $P(s'|s)$ and a state-dependent reward $R(s) \leq 0$ to maintain bounded exponentiated quantities. The core control variable is a next-state distribution $a(\cdot|s)$ , subject to absolute continuity with $P$ and normalization, with instantaneous reward

$\mathcal{R}(s, a) = R(s) - \lambda \, \mathrm{KL}(a(\cdot|s) \| P(\cdot|s)),$

where $\lambda > 0$ encodes the trade-off between reward and control cost.

Applying the Cole–Hopf transform $Z(s) = \exp(V(s)/\lambda)$ to the Bellman equation yields a linear eigenproblem:

$Z(s) = e^{R(s)/\lambda} \sum_{s'} P(s'|s) Z(s'),$

which admits analytical solution using power iteration or linear solvers, ensuring efficient and globally optimal value computation. The associated policy is:

$a^*(s'|s) = \frac{P(s'|s)Z(s')}{\sum_{y} P(y|s) Z(y)}.$

This framework is extensible to compositional structures, variable-slot symbolic expressions, and global policy synthesis in HRL (Jonsson et al., 2016, Infante et al., 2021, Liu et al., 2020).

2. Hierarchical Decomposition and Task Structure

Hierarchical decomposition in the LMDP setting involves representing an MDP $L = \langle S, P, R \rangle$ as a collection of tasks $\{L_0, \ldots, L_n\}$ arranged in a MAXQ-like hierarchy. Each subtask $L_i$ comprises a termination set $T_i \subset S$ , available subtask actions $A_i$ , and termination pseudo-rewards. For each non-terminal state $s$ , the constructed subtask-specific transition kernel $P_i$ and reward $R_i$ encode (a) primitive transitions, and (b) subtask-invoking actions, usually parameterized via the value functions of lower-level subtasks. When subtasks admit multiple termination states, compositionality is handled using auxiliary tasks, with closed-form expressions for value and termination probability (Jonsson et al., 2016, Infante et al., 2021).

Alternatively, (Infante et al., 2021) formalizes HRL for LMDPs by partitioning the state space $S = \bigsqcup_{i=1}^L S_i$ and associating to each block $S_i$ a sub-LMDP, inducing terminal states at the inter-partition boundaries and leveraging the compositionality principle: for each subtask, precomputed solutions for distinct boundary/terminal configurations permit the desirability function $z_i(s)$ in $S_i$ to be expressed as a linear combination of these bases, parametrized by global exit desirabilities. This decomposition reduces both the computational and sample complexity by focusing computation on smaller subregions and boundary states.

3. Compositionality and Analytical Policy Synthesis

The cornerstone of hierarchical RL with analytical expressions is compositionality—linear solvability enables solutions for new combinations of terminal rewards and boundary states to be synthesized analytically from pre-solved base LMDPs. In the HRL-LMDP framework (Infante et al., 2021), all global value and policy information propagates via a compact set of exit states $E = \bigcup_i T_i$ , resulting in a small eigenproblem

$z_E = G z_E$

where $G$ is constructed from precomputed subtask bases. Thus, the global value $z$ for all $s \in S$ is given analytically by

$z(s) = \sum_{k=1}^{n(i)} z_E(b_k) z_i^k(s) \quad \text{if } s \in S_i,$

and the policy at any state by the optimal extraction formula

$\pi^*(s'|s) = \frac{P(s'|s) z(s')}{\sum_u P(u|s) z(u)}.$

For subtasks or partitions with equivalent internal dynamics and rewards, symmetries can be exploited to share solutions, further reducing computational overhead. This approach provably yields the globally optimal flat policy without the nonstationarity issues of conventional option-based or separate high-level policies (Infante et al., 2021).

4. Analytical HRL with Symbolic and Memory-Augmented Models

Beyond LMDP-specific hierarchies, analytical compositionality is exploited in memory-augmented neural models for RL-driven compositional generalization. The LAnE model (Liu et al., 2020) employs a two-module neural architecture—a high-level "Composer" agent selects spans of source expressions as subgoals, and a low-level "Solver" agent translates these into target expressions. The process is governed by end-to-end hierarchical policy gradients, where both the selection of sub-expressions and their translation are learned analytically through REINFORCE, shaped by trajectory-level rewards that combine sequence similarity and analytical simplicity. LAnE demonstrates, with analytical policies and variable-slot abstraction, perfect generalization on SCAN systematicity and compositionality splits, exceeding conventional sequence-model architectures.

5. Online Algorithms and Convergence Properties

Analytical HRL frameworks admit efficient online learning via Z-learning (for LMDPs) and hierarchical policy gradient. In hierarchical Z-learning (Jonsson et al., 2016), updates of the form

$\hat Z_i(s_t) \leftarrow (1-\alpha) \hat Z_i(s_t) + \alpha\, e^{r_t/\lambda} \hat Z_i(s_{t+1}) \bigg[ \frac{P_i(s_{t+1}|s_t)}{\hat a_i(s_{t+1}|s_t)} \bigg]$

are simultaneously applied to all tasks for which $s_t \notin T_i$ (intra-task sharing), accelerating convergence and maintaining unbiasedness due to linearity. Convergence is guaranteed provided each task’s state graph is aperiodic and irreducible, given standard Robbins-Monro conditions on step sizes and sufficient exploration.

The partitioned HRL-LMDP algorithm (Infante et al., 2021) further reduces online sample complexity: only a small number of updates per region and boundary state are needed, and the propagation of learned desirabilities is computationally and sample-wise efficient whenever the block size and total number of unique boundary states are much smaller than the full state space.

In neural-symbolic HRL with analytical expressions (Liu et al., 2020), end-to-end training with curriculum learning is essential to overcome sparse and delayed rewards typical of compositional sequence tasks; both high-level and low-level modules are optimized jointly via trajectory-level policy gradient estimates, leveraging the analytical structure embedded in the memory architecture.

6. Empirical Evaluation and Performance Metrics

Extensive empirical testing validates the advantages of hierarchical RL with analytical expressions. In LMDP-based hierarchies (Jonsson et al., 2016), Z-learning with intra-task updates in the taxi and AGV (autonomous guided vehicle) domains achieves an order-of-magnitude faster convergence in value estimation error and throughput metrics compared to MAXQ-Q and flat Q-learning variants. The hierarchical HRL-LMDP of (Infante et al., 2021) demonstrates that, for domain decompositions with small boundary sets, the algorithm achieves the globally optimal policy at a fraction of the iteration and sample complexity of flat solvers.

Models exploiting analytical expression compositionality, such as LAnE (Liu et al., 2020), reach 100% accuracy on SCAN-style benchmarks, including challenging systematicity and productivity splits, and sustain performance with minimal supervision or extremely limited training data, outperforming standard neural baseline models.

7. Theoretical Guarantees and Scope of Applicability

The analytical tractability and compositionality principles of LMDPs and symbolic variable-slot models ensure provable global optimality and efficient policy extraction in hierarchical settings, provided the RL problem admits either linearly-solvable dynamics or compositional symbolic structure. The spectrum of applicability covers discrete grid worlds, structured navigation tasks, large but factorizable state spaces, and structured symbolic translation tasks where analytical policies can be extracted and composed. Bounded sample complexity and convergence rates are achievable by exploiting problem structure—state space partitions with small boundaries, analytical policy forms, and symbolic decomposition—without resorting to statistically brittle or computationally intensive approximations.

The limitations are inherent to expressibility: HRL with analytical expressions requires either linearly-solvable structure (e.g., LMDPs) or symbolic compositionality (e.g., variable-slot models), restricting generalization to classes of domains where such formalization is possible and tractable (Jonsson et al., 2016, Infante et al., 2021, Liu et al., 2020).