Hierarchical IQL-TD-MPC

Updated 18 March 2026

Hierarchical IQL-TD-MPC is a model-based RL algorithm that integrates implicit Q-learning with TD-MPC to address long-horizon, sparse-reward challenges.
It employs a two-level hierarchy where a Manager plans abstract actions and generates intent embeddings that guide an off-the-shelf Worker agent.
Empirical results on D4RL benchmarks demonstrate significant performance improvements over traditional flat offline RL methods.

Hierarchical IQL-TD-MPC is a model-based hierarchical reinforcement learning (RL) algorithm that extends Temporal Difference Learning for Model Predictive Control (TD-MPC) by integrating Implicit Q-Learning (IQL) in a temporally abstract manner. The approach addresses the challenges of long-horizon, sparse-reward tasks, particularly in offline RL, by employing a two-level hierarchy: a “Manager” based on IQL-TD-MPC, which plans using temporally extended abstract actions and intent embeddings, and a “Worker,” which can be any off-the-shelf offline RL agent leveraging the guidance provided by the Manager’s intent embeddings. This structure allows for efficient long-term planning and demonstrates significant empirical improvements on difficult navigation benchmarks (Chitnis et al., 2023).

1. Hierarchical Architecture and Role Separation

In hierarchical IQL-TD-MPC, the system is partitioned into a Manager and a Worker:

Manager (IQL-TD-MPC):
- Operates at a temporal abstraction of $k$ environment steps per Manager step.
- Learns a latent dynamics model, reward predictor, critic $Q^M$ , value $V^M$ , and a discrete policy $\pi^M$ using an offline model-based RL framework combining TD-MPC and IQL losses.
- At evaluation, executes Model Predictive Control (MPC) in latent space for a planning horizon of $H$ abstract steps ( $k \cdot H$ environment steps), generating a sequence of abstract actions $a^M_{t},\ldots,a^M_{t+kH}$ .
- From the first abstract action $a^M_t$ , derives an intent embedding $g_t$ approximating a subgoal $k$ steps ahead.
Worker (Off-the-Shelf Offline RL Agent):
- Operates at the environment’s native time scale.
- Receives input state augmented with the intent embedding: $[s_t; g_t]$ .
- Utilizes standard optimization routines and loss functions (e.g., AWAC, TD3-BC, DT, CQL) without algorithmic modifications, aside from input dimensionality.

The intent embedding $g_t$ is defined as $g_t = f^M_\theta(z^M_t, a^M_t) - z^M_t$ , where $z^M_t = h^M_\theta(s_t)$ is the Manager’s latent state encoding, and $f^M_\theta$ is the learned latent forward model. This formulation allows the Worker to resolve long-term ambiguities in offline data by leveraging the Manager’s temporal abstractions and subgoal representations.

2. Mathematical Formulation and Optimization

The algorithm integrates IQL and TD-MPC objectives in both state and latent spaces, structured as follows:

2.1 IQL Objective in State Space

Asymmetric Regression for Value Function (Expectile, $\tau\in(0.5,1)$ ):

$L_V = \mathbb{E}_{(s,a)\sim D}\left[ L_2^\tau(Q_{\text{target}}(s, a) - V(s)) \right] \ L_2^\tau(u) = |\tau - 1_{u < 0}| \cdot u^2$

TD Loss for Critic:

$L_Q = \mathbb{E}_{(s, a, r, s')\sim D}\left[( Q(s, a) - (r + \gamma V(s')) )^2\right]$

Advantage-Weighted Regression (AWR) Policy Loss:

$L_\pi = -\mathbb{E}_{(s, a)\sim D}\left[\exp(\beta (Q(s, a) - V(s)) ) \cdot \log \pi(a|s) \right]$

2.2 TD-MPC Losses in Latent Space

Latent Consistency (Model) Loss:

$L_f = \mathbb{E} \left[ \| f_\theta(z_t, a_t) - h_\theta(s_{t+1}) \|^2 \right]$

Reward Prediction Loss:

$L_R = \mathbb{E} \left[ | r_\theta(z_t, a_t) - r_{t+1} |^2 \right]$

Latent-space Critic TD Loss:

$L_Q^{TD} = \mathbb{E} \left[ ( Q(z_t, a_t) - [ r_{t+1} + \gamma Q(z_{t+1}, \pi(z_{t+1})) ] )^2 \right]$

Latent-space Policy Improvement Loss:

$L_\pi^{TD} = -\mathbb{E} [ Q(z_t, \pi(z_t)) ]$

2.3 Integrated Optimization

The full IQL-TD-MPC loss is expressed as:

$L_{\text{total}} = c_f\,L_f + c_R\,L_R + c_Q\,L_Q^{TD} + \lambda_V\,L_V + \lambda_\pi\,L_\pi$

with weighting coefficients: $c_f = 2$ , $c_R = 0.5$ , $c_Q = 0.1$ , and typical values $\lambda_V \approx 0.1$ , $\lambda_\pi$ set via AWR weight $\beta$ . The policy output can be either Gaussian (continuous) or categorical (discrete), reflecting the action space of the underlying task (Chitnis et al., 2023).

3. Manager Pre-training on Temporally Abstracted Data

3.1 Temporal Abstraction and Abstract Transitions

The Manager is pretrained to model temporally abstract transitions:

Coarsening parameter $k$ :
- One Manager step corresponds to $k$ environment steps.
For a trajectory $s_0, a_0, \ldots, s_{kH}$ , create transitions:

$( s_{t k},\, s_{(t+1)k},\, r^M_{t k} ),\quad r^M_{t k} = \sum_{i=t k}^{(t+1) k - 1} r_i$

Abstract action (inverse model):

$a^M_{t k} = b^M_\theta(z^M_{t k}, z^M_{(t+1) k})$

3.2 Model Architecture

Encoder $h^M$ : multilayer perceptron mapping $s \in \mathbb{R}^n$ to $z^M \in \mathbb{R}^d$ ( $d \approx 10$ ).
Inverse model $b^M$ : maps $(z^M_t, z^M_{t+k})$ to logits of $L$ discrete categorical variables, each with $C$ classes ( $L\approx8$ , $C\approx10$ ).
Forward model $f^M$ , reward predictor $r^M_\theta$ , critic $Q^M$ , value $V^M$ , and policy $\pi^M$ , each parameterized as MLPs with 2–3 hidden layers.

3.3 Pre-training Regimen

Losses are identical to standard IQL-TD-MPC but applied to abstract transitions.
Optimization performed end-to-end via Adam ( $\text{lr}=3\times 10^{-4}$ , batch size 256, 300K steps).
After pretraining, all Manager parameters are frozen (Chitnis et al., 2023).

4. Worker Integration with Intent Embeddings

At each environment step $t$ :

Compute $z^M_t = h^M(s_t)$ .
Compute $a^M_t = b^M(z^M_t, z^M_{t+k})$ ( $z^M_{t+k}$ from $f^M$ during rollout).
Intent embedding: $g_t = f^M(z^M_t, a^M_t) - z^M_t$ .
Worker policy input: $x_t = [s_t; g_t] \in \mathbb{R}^{n+d}$ .

Worker agents retain their canonical loss functions and optimizer configurations; only the observation input shape is modified. No additional regularization or auxiliary objectives are introduced for the Worker. Worker algorithms used include AWAC, TD3-BC, DT, and CQL, with hyperparameters aligned to CORL defaults except for the augmented state dimension (Chitnis et al., 2023).

5. Empirical Evaluation and Quantitative Results

5.1 Experimental Protocol

Environments:

D4RL AntMaze variants (umaze, medium, large, ultra, play/diverse splits), maze2d-medium-v1, halfcheetah-medium-v2.

Data:

Offline datasets with 200K–1M transitions, sparse reward structure.

Manager Hyperparameters:

$k=8$ , $H=4$ , $d=10$ , $L=8$ , $C=10$ , $\tau=0.9$ , $\beta=3/\text{reward\_scale}$ , 300K steps.

Worker Hyperparameters:

As per CORL defaults. Only change is state dimension augmented by $d=10$ .

5.2 Results Table

Task	AWAC	BC	DT	IQL	TD3-BC	CQL
antmaze-medium-play	0 → 36	0 → 52	0 → 43	70 → 64	0.2 → 60	0.8 → 33
antmaze-medium-diverse	0.8 → 16	0.2 → 20	0.2 → 33	63 → 30	0.4 → 21	0.2 → 14
antmaze-large-play	0 → 67	0 → 50	0 → 53	54 → 70	0 → 46	0 → 19
antmaze-large-diverse	0 → 40	0 → 38	0 → 31	31 → 46	0 → 29	0 → 16
maze2d-medium-v1	43 → 67	3 → 70	13 → 71	32 → 78	101 → 47	104 → 16
halfcheetah-medium-v2	49 → 45	42 → 45	42 → 47	47 → 43	47 → 44	46 → 44

Green “→” denotes statistically significant improvement ( $p<0.05$ ). In AntMaze tasks, baseline offline RL scores are near zero; augmenting with Manager-derived intent provides normalized scores in the 30–70 range (Chitnis et al., 2023).

6. Ablations and Analytical Insights

6.1 Random-Vector Ablation

Replacing $g_t$ with i.i.d. random vectors removes performance gains, indicating Workers ignore non-informative intent and that $g_t$ encodes relevant goal structure.

6.2 Architectural and Hyperparameter Sensitivity

Embedding dimension: best performance at $d=10$ ; lower or higher dimensions yield suboptimal results.
Abstract step size $k$ : empirically $k=8$ balances abstraction and fidelity.
Manager pretraining: performance plateaus beyond 200–300K steps.

6.3 Limitations

In fine-grained locomotion tasks (e.g., halfcheetah-medium-v2), appending $g_t$ may degrade performance, plausibly due to the lack of natural hierarchical structure or misleading intent information.
Manager’s MPC planning at inference is computationally non-trivial.
Fixed $k$ may limit adaptability; variable abstraction lengths represent a direction for future research (Chitnis et al., 2023).

7. Significance and Outlook

Hierarchical IQL-TD-MPC demonstrates that augmenting standard offline RL agents with structured, temporally abstract information from a pretrained Manager can robustly resolve long-horizon planning in complex sparse-reward domains. This paradigm achieves significant gains in navigation and manipulation benchmarks where flat agents underperform, substantiating the efficacy of hierarchical abstraction and model-based planning in the offline RL regime. Future research may address limitations relating to task suitability, computational efficiency, and the flexibility of abstraction mechanisms (Chitnis et al., 2023).

Markdown Report Issue Upgrade to Chat

References (1)

IQL-TD-MPC: Implicit Q-Learning for Hierarchical Model Predictive Control (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hierarchical IQL-TD-MPC.