HIQL: Hierarchical Implicit Q-Learning

Updated 6 April 2026

HIQL is a hierarchical RL framework that decomposes learning into high-level subgoal prediction and low-level policy execution.
It utilizes an action-free value function to reduce approximation noise and improve offline goal-conditioned and inverse RL performance.
Extensions like normalizing flows and physics-informed regularization enhance its expressivity and empirical success in long-horizon tasks.

Hierarchical Implicit Q-Learning (HIQL) refers to a family of hierarchical reinforcement learning (RL) and inverse RL algorithms unified by a two-level decomposition over subgoal prediction and subgoal-conditioned policy optimization. In HIQL for offline goal-conditioned RL, the high-level agent proposes a future state (or its learned latent) as a subgoal, and the low-level agent executes primitive actions to reach it; both policies derive from a single “action-free” value function, sidestepping the need for explicit action maximization and enabling robust, scalable use of offline, potentially action-free, datasets. Distinctly, HIQL for inverse RL (multi-intention behavior modeling) applies a similar segmentation to demonstrations, modeling behavior as arising from switching between discrete latent intentions, each with its own reward and value function inferred by EM-based procedure. Ongoing variants, such as flow-based parameterizations or physics-informed (PDE-regularized) learning, significantly extend the class’s expressivity and applicability.

1. Problem Setting, Formulation, and Motivation

In offline goal-conditioned RL, the agent operates in a finite-horizon discounted MDP, $M=(\mathcal S, \mathcal A, p, \mu, \gamma)$ , with a goal space $\mathcal G = \mathcal S$ , episodic sampling of initial state and goal, and sparse reward $r(s, g) = 0$ if $s=g$ , $-1$ otherwise. The aim is to learn a goal-conditioned policy $\pi(a|s,g)$ maximizing expected discounted return: $J(\pi) = \mathbb{E}_{g\sim p(g),\,\tau\sim p^\pi}\left[\sum_t \gamma^t r(s_t,g)\right]$ Crucially, only an offline dataset of trajectories is available, consisting of $(s_t, a_t, s_{t+1})$ tuples (action-annotated), optionally enhanced by additional action-free $(s_t, s_{t+1})$ pairs (Park et al., 2023).

Directly estimating the value of reaching distant goals is subject to dramatic value approximation noise, inhibiting the efficacy of flat (non-hierarchical) RL approaches, especially in long-horizon or high-dimensional scenarios. Inverse RL settings introduce the need to infer intention- or context-dependent reward segments from demonstrations, under the hypothesis that behavior is better described as a sequence of latent discrete intentions, rather than a single temporally stationary motive (Zhu et al., 2023).

HIQL addresses these structural and statistical challenges through temporal abstraction (hierarchy over subgoals) and, in the inverse setting, intention segmentation.

2. Algorithmic Structure and Hierarchical Decomposition

The essential algorithmic contribution of HIQL in goal-conditioned RL is its decomposition:

High-Level Policy ( $\pi^H$ ): Every $\mathcal G = \mathcal S$ 0 steps, produces a subgoal $\mathcal G = \mathcal S$ 1 or low-dimensional encoding $\mathcal G = \mathcal S$ 2. The high-level “action” space is the (latent) state space.
Low-Level Policy ( $\mathcal G = \mathcal S$ 3): At each step, attempts to reach the current subgoal $\mathcal G = \mathcal S$ 4, mapping current state and subgoal to actions.

This leads to a three-phase training protocol (Park et al., 2023, Giammarino et al., 8 Sep 2025):

Value Function Learning: Train a single goal-conditioned value network $\mathcal G = \mathcal S$ 5 via expectile-regression (action-free IQL).
High-Level Policy Learning: Optimize $\mathcal G = \mathcal S$ 6 via advantage-weighted regression (AWR) over $\mathcal G = \mathcal S$ 7-step state pairs, using action-free data.
Low-Level Policy Learning: Optimize $\mathcal G = \mathcal S$ 8 via AWR on $\mathcal G = \mathcal S$ 9 tuples, requiring action labels.

The low-level policy “implements” each high-level subgoal; both leverage the same value function, but with different advantage definitions: $r(s, g) = 0$ 0 The statistical rationale is that value differences for nearby goals (subgoals) are larger, less noisy, and hence yield robust learning signals.

Experimental results demonstrate that this hierarchy enables substantial gains in long-horizon, state-space coverage, and data efficiency over flat baselines (see Section 5).

3. Mathematical Objectives and Policy Extraction

The value-learning phase minimizes expectile Bellman error: $r(s, g) = 0$ 1 with expectile loss $r(s, g) = 0$ 2, and a slowly-updated target network $r(s, g) = 0$ 3.

Policy extraction is performed using AWR objectives, which for both levels take the form: $r(s, g) = 0$ 4 Here, $r(s, g) = 0$ 5 is an inverse temperature parameter that weights higher-advantage actions or subgoals more heavily.

A key theoretical result quantifies error propagation in a toy 1D grid, showing that hierarchical decomposition can strictly reduce the probability of taking the wrong action compared to a flat policy, for suitable subgoal horizon $r(s, g) = 0$ 6 (Park et al., 2023).

4. Extensions: Flow Policies and Physics-Informed Regularization

Normalizing Flow-Based HIQL

NF–HIQL extends the original framework by parameterizing both high- and low-level policies using expressive normalizing flow models (e.g., RealNVP) instead of unimodal Gaussians (Garg et al., 11 Feb 2026). This allows for:

Multimodal action/subgoal distributions, essential for contact-rich and multi-path environments
Tractable likelihoods and analytic gradients
KL divergence-based control ensuring policies remain close to the behavior distribution

Explicit PAC-style sample efficiency bounds are derived, showing that NF–HIQL retains the sample efficiency and theoretical stability guarantees of the base algorithm.

Physics-Informed HIQL

Pi–HIQL introduces a geometric regularization on the value network, based on the Eikonal PDE, enforcing the constraint $r(s, g) = 0$ 7—so $r(s, g) = 0$ 8 behaves as a distance-to-goal field (Giammarino et al., 8 Sep 2025). This is motivated by optimal-control derivations (HJB equations) and is implemented as an additive penalty to the value loss: $r(s, g) = 0$ 9 Empirical ablations show this geometric bias is beneficial particularly for navigation and stitching regimes, when value smoothness and spatial structure are crucial.

5. Inverse Q-Learning HIQL: Multi-Intention Behavior Segmentation

A separate but related HIQL variant addresses interpretable behavior modeling via inverse RL, as in "Multi-intention Inverse Q-learning" (Zhu et al., 2023). Here, the key idea is to segment demonstration trajectories into $s=g$ 0 latent, piecewise-constant "intention" blocks, each with its own reward function $s=g$ 1 and Q-function $s=g$ 2.

Segmentation: Each trajectory is modeled as sequence of intentions, with change-points either explicit (step function) or as a latent Markov chain.
For each segment, standard inverse Q-learning is applied, assuming expert behavior is soft-optimal w.r.t. the reward in that segment.
An EM-style algorithm alternates between inferring latent segment boundaries (E-step, forward-backward/posterior smoothing) and solving per-segment IRL (M-step).

The result is a set of intention-specific reward maps and policies, which yield superior prediction and interpretability of complex, non-stationary animal/human behaviors compared to standard single-reward IRL (Zhu et al., 2023).

6. Hyperparameter Sensitivity and Empirical Behavior

Studies of hyperparameter sensitivity (Töpperwien et al., 5 Feb 2026) indicate that, in offline goal-conditioned RL, HIQL exhibits sharp, phase-dependent optima for hyperparameters such as learning rate and discount factor. The cause is traced to destructive gradient interference due to bootstrapped targets across relabeled goals; inter-goal gradient alignment diagnostics provide a quantifiable measure. In contrast, non-TD quasimetric learning objectives display broad, stable optima.

Empirical guidance for HIQL includes:

Careful tuning of $s=g$ 3 in each training phase and data mix
Preference for lower learning rates and moderate values of $s=g$ 4
Routine monitoring of inter-goal gradient alignment to preempt instability

7. Experimental Evaluation

HIQL and its variants have been empirically evaluated across:

State-based domains: AntMaze-medium, -large, -ultra; Kitchen; CALVIN; with HIQL exceeding prior methods (e.g., HIQL $s=g$ 5 in AntMaze-Large vs. baselines $s=g$ 6 or lower) (Park et al., 2023).
Pixel-based settings: Procgen Maze, Visual AntMaze, Roboverse; HIQL reliably outperforms both flat and prior hierarchical approaches, scales to high-dimensional observations, and utilizes action-free data.
Physics-informed HIQL: Dramatic gains in pointmaze-giant-navigate, antmaze-giant-stitch under Pi-regularization, often more than doubling baseline performance (Giammarino et al., 8 Sep 2025).
Normalizing Flow HIQL: Robust generalization under data scarcity, with NF–HIQL maintaining competitive or superior performance in long-horizon and multimodal tasks when HIQL baselines degrade rapidly (Garg et al., 11 Feb 2026).
Inverse Q-Learning HIQL: Exact reward recovery in synthetic foraging; crisp intention segmentation and policy learning in mouse-labyrinth and reversal-learning datasets (Zhu et al., 2023).

Training with as little as 25% action-labeled transitions incurs negligible performance drop, demonstrating robust exploitation of action-free data (Park et al., 2023).

In summary, HIQL frameworks—both in forward RL and inverse RL—enable scalable, robust, and interpretable policy learning in settings characterized by limited data, long horizons, sparse rewards, and nonstationary behavioral regimes. This is achieved via a unified hierarchy over subgoals or intentions, action-free value learning, and, as developed in recent extensions, by integrating expressive policy parameterizations and domain physics (Park et al., 2023, Giammarino et al., 8 Sep 2025, Zhu et al., 2023, Garg et al., 11 Feb 2026, Töpperwien et al., 5 Feb 2026).