Papers
Topics
Authors
Recent
Search
2000 character limit reached

Hierarchical IQL-TD-MPC

Updated 18 March 2026
  • Hierarchical IQL-TD-MPC is a model-based RL algorithm that integrates implicit Q-learning with TD-MPC to address long-horizon, sparse-reward challenges.
  • It employs a two-level hierarchy where a Manager plans abstract actions and generates intent embeddings that guide an off-the-shelf Worker agent.
  • Empirical results on D4RL benchmarks demonstrate significant performance improvements over traditional flat offline RL methods.

Hierarchical IQL-TD-MPC is a model-based hierarchical reinforcement learning (RL) algorithm that extends Temporal Difference Learning for Model Predictive Control (TD-MPC) by integrating Implicit Q-Learning (IQL) in a temporally abstract manner. The approach addresses the challenges of long-horizon, sparse-reward tasks, particularly in offline RL, by employing a two-level hierarchy: a “Manager” based on IQL-TD-MPC, which plans using temporally extended abstract actions and intent embeddings, and a “Worker,” which can be any off-the-shelf offline RL agent leveraging the guidance provided by the Manager’s intent embeddings. This structure allows for efficient long-term planning and demonstrates significant empirical improvements on difficult navigation benchmarks (Chitnis et al., 2023).

1. Hierarchical Architecture and Role Separation

In hierarchical IQL-TD-MPC, the system is partitioned into a Manager and a Worker:

  • Manager (IQL-TD-MPC):
    • Operates at a temporal abstraction of kk environment steps per Manager step.
    • Learns a latent dynamics model, reward predictor, critic QMQ^M, value VMV^M, and a discrete policy πM\pi^M using an offline model-based RL framework combining TD-MPC and IQL losses.
    • At evaluation, executes Model Predictive Control (MPC) in latent space for a planning horizon of HH abstract steps (kHk \cdot H environment steps), generating a sequence of abstract actions atM,,at+kHMa^M_{t},\ldots,a^M_{t+kH}.
    • From the first abstract action atMa^M_t, derives an intent embedding gtg_t approximating a subgoal kk steps ahead.
  • Worker (Off-the-Shelf Offline RL Agent):
    • Operates at the environment’s native time scale.
    • Receives input state augmented with the intent embedding: [st;gt][s_t; g_t].
    • Utilizes standard optimization routines and loss functions (e.g., AWAC, TD3-BC, DT, CQL) without algorithmic modifications, aside from input dimensionality.

The intent embedding gtg_t is defined as gt=fθM(ztM,atM)ztMg_t = f^M_\theta(z^M_t, a^M_t) - z^M_t, where ztM=hθM(st)z^M_t = h^M_\theta(s_t) is the Manager’s latent state encoding, and fθMf^M_\theta is the learned latent forward model. This formulation allows the Worker to resolve long-term ambiguities in offline data by leveraging the Manager’s temporal abstractions and subgoal representations.

2. Mathematical Formulation and Optimization

The algorithm integrates IQL and TD-MPC objectives in both state and latent spaces, structured as follows:

2.1 IQL Objective in State Space

  • Asymmetric Regression for Value Function (Expectile, τ(0.5,1)\tau\in(0.5,1)):

LV=E(s,a)D[L2τ(Qtarget(s,a)V(s))] L2τ(u)=τ1u<0u2L_V = \mathbb{E}_{(s,a)\sim D}\left[ L_2^\tau(Q_{\text{target}}(s, a) - V(s)) \right] \ L_2^\tau(u) = |\tau - 1_{u < 0}| \cdot u^2

  • TD Loss for Critic:

LQ=E(s,a,r,s)D[(Q(s,a)(r+γV(s)))2]L_Q = \mathbb{E}_{(s, a, r, s')\sim D}\left[( Q(s, a) - (r + \gamma V(s')) )^2\right]

Lπ=E(s,a)D[exp(β(Q(s,a)V(s)))logπ(as)]L_\pi = -\mathbb{E}_{(s, a)\sim D}\left[\exp(\beta (Q(s, a) - V(s)) ) \cdot \log \pi(a|s) \right]

2.2 TD-MPC Losses in Latent Space

  • Latent Consistency (Model) Loss:

Lf=E[fθ(zt,at)hθ(st+1)2]L_f = \mathbb{E} \left[ \| f_\theta(z_t, a_t) - h_\theta(s_{t+1}) \|^2 \right]

  • Reward Prediction Loss:

LR=E[rθ(zt,at)rt+12]L_R = \mathbb{E} \left[ | r_\theta(z_t, a_t) - r_{t+1} |^2 \right]

  • Latent-space Critic TD Loss:

LQTD=E[(Q(zt,at)[rt+1+γQ(zt+1,π(zt+1))])2]L_Q^{TD} = \mathbb{E} \left[ ( Q(z_t, a_t) - [ r_{t+1} + \gamma Q(z_{t+1}, \pi(z_{t+1})) ] )^2 \right]

  • Latent-space Policy Improvement Loss:

LπTD=E[Q(zt,π(zt))]L_\pi^{TD} = -\mathbb{E} [ Q(z_t, \pi(z_t)) ]

2.3 Integrated Optimization

The full IQL-TD-MPC loss is expressed as:

Ltotal=cfLf+cRLR+cQLQTD+λVLV+λπLπL_{\text{total}} = c_f\,L_f + c_R\,L_R + c_Q\,L_Q^{TD} + \lambda_V\,L_V + \lambda_\pi\,L_\pi

with weighting coefficients: cf=2c_f = 2, cR=0.5c_R = 0.5, cQ=0.1c_Q = 0.1, and typical values λV0.1\lambda_V \approx 0.1, λπ\lambda_\pi set via AWR weight β\beta. The policy output can be either Gaussian (continuous) or categorical (discrete), reflecting the action space of the underlying task (Chitnis et al., 2023).

3. Manager Pre-training on Temporally Abstracted Data

3.1 Temporal Abstraction and Abstract Transitions

The Manager is pretrained to model temporally abstract transitions:

  • Coarsening parameter kk:
    • One Manager step corresponds to kk environment steps.
  • For a trajectory s0,a0,,skHs_0, a_0, \ldots, s_{kH}, create transitions:

(stk,s(t+1)k,rtkM),rtkM=i=tk(t+1)k1ri( s_{t k},\, s_{(t+1)k},\, r^M_{t k} ),\quad r^M_{t k} = \sum_{i=t k}^{(t+1) k - 1} r_i

  • Abstract action (inverse model):

atkM=bθM(ztkM,z(t+1)kM)a^M_{t k} = b^M_\theta(z^M_{t k}, z^M_{(t+1) k})

3.2 Model Architecture

  • Encoder hMh^M: multilayer perceptron mapping sRns \in \mathbb{R}^n to zMRdz^M \in \mathbb{R}^d (d10d \approx 10).
  • Inverse model bMb^M: maps (ztM,zt+kM)(z^M_t, z^M_{t+k}) to logits of LL discrete categorical variables, each with CC classes (L8L\approx8, C10C\approx10).
  • Forward model fMf^M, reward predictor rθMr^M_\theta, critic QMQ^M, value VMV^M, and policy πM\pi^M, each parameterized as MLPs with 2–3 hidden layers.

3.3 Pre-training Regimen

  • Losses are identical to standard IQL-TD-MPC but applied to abstract transitions.
  • Optimization performed end-to-end via Adam (lr=3×104\text{lr}=3\times 10^{-4}, batch size 256, 300K steps).
  • After pretraining, all Manager parameters are frozen (Chitnis et al., 2023).

4. Worker Integration with Intent Embeddings

At each environment step tt:

  • Compute ztM=hM(st)z^M_t = h^M(s_t).
  • Compute atM=bM(ztM,zt+kM)a^M_t = b^M(z^M_t, z^M_{t+k}) (zt+kMz^M_{t+k} from fMf^M during rollout).
  • Intent embedding: gt=fM(ztM,atM)ztMg_t = f^M(z^M_t, a^M_t) - z^M_t.
  • Worker policy input: xt=[st;gt]Rn+dx_t = [s_t; g_t] \in \mathbb{R}^{n+d}.

Worker agents retain their canonical loss functions and optimizer configurations; only the observation input shape is modified. No additional regularization or auxiliary objectives are introduced for the Worker. Worker algorithms used include AWAC, TD3-BC, DT, and CQL, with hyperparameters aligned to CORL defaults except for the augmented state dimension (Chitnis et al., 2023).

5. Empirical Evaluation and Quantitative Results

5.1 Experimental Protocol

  • Environments:

D4RL AntMaze variants (umaze, medium, large, ultra, play/diverse splits), maze2d-medium-v1, halfcheetah-medium-v2.

  • Data:

Offline datasets with 200K–1M transitions, sparse reward structure.

  • Manager Hyperparameters:

k=8k=8, H=4H=4, d=10d=10, L=8L=8, C=10C=10, τ=0.9\tau=0.9, β=3/reward_scale\beta=3/\text{reward\_scale}, 300K steps.

  • Worker Hyperparameters:

As per CORL defaults. Only change is state dimension augmented by d=10d=10.

5.2 Results Table

Task AWAC BC DT IQL TD3-BC CQL
antmaze-medium-play 0 → 36 0 → 52 0 → 43 70 → 64 0.2 → 60 0.8 → 33
antmaze-medium-diverse 0.8 → 16 0.2 → 20 0.2 → 33 63 → 30 0.4 → 21 0.2 → 14
antmaze-large-play 0 → 67 0 → 50 0 → 53 54 → 70 0 → 46 0 → 19
antmaze-large-diverse 0 → 40 0 → 38 0 → 31 31 → 46 0 → 29 0 → 16
maze2d-medium-v1 43 → 67 3 → 70 13 → 71 32 → 78 101 → 47 104 → 16
halfcheetah-medium-v2 49 → 45 42 → 45 42 → 47 47 → 43 47 → 44 46 → 44

Green “→” denotes statistically significant improvement (p<0.05p<0.05). In AntMaze tasks, baseline offline RL scores are near zero; augmenting with Manager-derived intent provides normalized scores in the 30–70 range (Chitnis et al., 2023).

6. Ablations and Analytical Insights

6.1 Random-Vector Ablation

Replacing gtg_t with i.i.d. random vectors removes performance gains, indicating Workers ignore non-informative intent and that gtg_t encodes relevant goal structure.

6.2 Architectural and Hyperparameter Sensitivity

  • Embedding dimension: best performance at d=10d=10; lower or higher dimensions yield suboptimal results.
  • Abstract step size kk: empirically k=8k=8 balances abstraction and fidelity.
  • Manager pretraining: performance plateaus beyond 200–300K steps.

6.3 Limitations

  • In fine-grained locomotion tasks (e.g., halfcheetah-medium-v2), appending gtg_t may degrade performance, plausibly due to the lack of natural hierarchical structure or misleading intent information.
  • Manager’s MPC planning at inference is computationally non-trivial.
  • Fixed kk may limit adaptability; variable abstraction lengths represent a direction for future research (Chitnis et al., 2023).

7. Significance and Outlook

Hierarchical IQL-TD-MPC demonstrates that augmenting standard offline RL agents with structured, temporally abstract information from a pretrained Manager can robustly resolve long-horizon planning in complex sparse-reward domains. This paradigm achieves significant gains in navigation and manipulation benchmarks where flat agents underperform, substantiating the efficacy of hierarchical abstraction and model-based planning in the offline RL regime. Future research may address limitations relating to task suitability, computational efficiency, and the flexibility of abstraction mechanisms (Chitnis et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hierarchical IQL-TD-MPC.