Hierarchical IQL-TD-MPC
- Hierarchical IQL-TD-MPC is a model-based RL algorithm that integrates implicit Q-learning with TD-MPC to address long-horizon, sparse-reward challenges.
- It employs a two-level hierarchy where a Manager plans abstract actions and generates intent embeddings that guide an off-the-shelf Worker agent.
- Empirical results on D4RL benchmarks demonstrate significant performance improvements over traditional flat offline RL methods.
Hierarchical IQL-TD-MPC is a model-based hierarchical reinforcement learning (RL) algorithm that extends Temporal Difference Learning for Model Predictive Control (TD-MPC) by integrating Implicit Q-Learning (IQL) in a temporally abstract manner. The approach addresses the challenges of long-horizon, sparse-reward tasks, particularly in offline RL, by employing a two-level hierarchy: a “Manager” based on IQL-TD-MPC, which plans using temporally extended abstract actions and intent embeddings, and a “Worker,” which can be any off-the-shelf offline RL agent leveraging the guidance provided by the Manager’s intent embeddings. This structure allows for efficient long-term planning and demonstrates significant empirical improvements on difficult navigation benchmarks (Chitnis et al., 2023).
1. Hierarchical Architecture and Role Separation
In hierarchical IQL-TD-MPC, the system is partitioned into a Manager and a Worker:
- Manager (IQL-TD-MPC):
- Operates at a temporal abstraction of environment steps per Manager step.
- Learns a latent dynamics model, reward predictor, critic , value , and a discrete policy using an offline model-based RL framework combining TD-MPC and IQL losses.
- At evaluation, executes Model Predictive Control (MPC) in latent space for a planning horizon of abstract steps ( environment steps), generating a sequence of abstract actions .
- From the first abstract action , derives an intent embedding approximating a subgoal steps ahead.
- Worker (Off-the-Shelf Offline RL Agent):
The intent embedding is defined as , where is the Manager’s latent state encoding, and is the learned latent forward model. This formulation allows the Worker to resolve long-term ambiguities in offline data by leveraging the Manager’s temporal abstractions and subgoal representations.
2. Mathematical Formulation and Optimization
The algorithm integrates IQL and TD-MPC objectives in both state and latent spaces, structured as follows:
2.1 IQL Objective in State Space
- Asymmetric Regression for Value Function (Expectile, ):
- TD Loss for Critic:
- Advantage-Weighted Regression (AWR) Policy Loss:
2.2 TD-MPC Losses in Latent Space
- Latent Consistency (Model) Loss:
- Reward Prediction Loss:
- Latent-space Critic TD Loss:
- Latent-space Policy Improvement Loss:
2.3 Integrated Optimization
The full IQL-TD-MPC loss is expressed as:
with weighting coefficients: , , , and typical values , set via AWR weight . The policy output can be either Gaussian (continuous) or categorical (discrete), reflecting the action space of the underlying task (Chitnis et al., 2023).
3. Manager Pre-training on Temporally Abstracted Data
3.1 Temporal Abstraction and Abstract Transitions
The Manager is pretrained to model temporally abstract transitions:
- Coarsening parameter :
- One Manager step corresponds to environment steps.
- For a trajectory , create transitions:
- Abstract action (inverse model):
3.2 Model Architecture
- Encoder : multilayer perceptron mapping to ().
- Inverse model : maps to logits of discrete categorical variables, each with classes (, ).
- Forward model , reward predictor , critic , value , and policy , each parameterized as MLPs with 2–3 hidden layers.
3.3 Pre-training Regimen
- Losses are identical to standard IQL-TD-MPC but applied to abstract transitions.
- Optimization performed end-to-end via Adam (, batch size 256, 300K steps).
- After pretraining, all Manager parameters are frozen (Chitnis et al., 2023).
4. Worker Integration with Intent Embeddings
At each environment step :
- Compute .
- Compute ( from during rollout).
- Intent embedding: .
- Worker policy input: .
Worker agents retain their canonical loss functions and optimizer configurations; only the observation input shape is modified. No additional regularization or auxiliary objectives are introduced for the Worker. Worker algorithms used include AWAC, TD3-BC, DT, and CQL, with hyperparameters aligned to CORL defaults except for the augmented state dimension (Chitnis et al., 2023).
5. Empirical Evaluation and Quantitative Results
5.1 Experimental Protocol
- Environments:
D4RL AntMaze variants (umaze, medium, large, ultra, play/diverse splits), maze2d-medium-v1, halfcheetah-medium-v2.
- Data:
Offline datasets with 200K–1M transitions, sparse reward structure.
- Manager Hyperparameters:
, , , , , , , 300K steps.
- Worker Hyperparameters:
As per CORL defaults. Only change is state dimension augmented by .
5.2 Results Table
| Task | AWAC | BC | DT | IQL | TD3-BC | CQL |
|---|---|---|---|---|---|---|
| antmaze-medium-play | 0 → 36 | 0 → 52 | 0 → 43 | 70 → 64 | 0.2 → 60 | 0.8 → 33 |
| antmaze-medium-diverse | 0.8 → 16 | 0.2 → 20 | 0.2 → 33 | 63 → 30 | 0.4 → 21 | 0.2 → 14 |
| antmaze-large-play | 0 → 67 | 0 → 50 | 0 → 53 | 54 → 70 | 0 → 46 | 0 → 19 |
| antmaze-large-diverse | 0 → 40 | 0 → 38 | 0 → 31 | 31 → 46 | 0 → 29 | 0 → 16 |
| maze2d-medium-v1 | 43 → 67 | 3 → 70 | 13 → 71 | 32 → 78 | 101 → 47 | 104 → 16 |
| halfcheetah-medium-v2 | 49 → 45 | 42 → 45 | 42 → 47 | 47 → 43 | 47 → 44 | 46 → 44 |
Green “→” denotes statistically significant improvement (). In AntMaze tasks, baseline offline RL scores are near zero; augmenting with Manager-derived intent provides normalized scores in the 30–70 range (Chitnis et al., 2023).
6. Ablations and Analytical Insights
6.1 Random-Vector Ablation
Replacing with i.i.d. random vectors removes performance gains, indicating Workers ignore non-informative intent and that encodes relevant goal structure.
6.2 Architectural and Hyperparameter Sensitivity
- Embedding dimension: best performance at ; lower or higher dimensions yield suboptimal results.
- Abstract step size : empirically balances abstraction and fidelity.
- Manager pretraining: performance plateaus beyond 200–300K steps.
6.3 Limitations
- In fine-grained locomotion tasks (e.g., halfcheetah-medium-v2), appending may degrade performance, plausibly due to the lack of natural hierarchical structure or misleading intent information.
- Manager’s MPC planning at inference is computationally non-trivial.
- Fixed may limit adaptability; variable abstraction lengths represent a direction for future research (Chitnis et al., 2023).
7. Significance and Outlook
Hierarchical IQL-TD-MPC demonstrates that augmenting standard offline RL agents with structured, temporally abstract information from a pretrained Manager can robustly resolve long-horizon planning in complex sparse-reward domains. This paradigm achieves significant gains in navigation and manipulation benchmarks where flat agents underperform, substantiating the efficacy of hierarchical abstraction and model-based planning in the offline RL regime. Future research may address limitations relating to task suitability, computational efficiency, and the flexibility of abstraction mechanisms (Chitnis et al., 2023).