Double-Horizon MBPO: Adaptive Hyperparameter Scheduling
- DHMBPO is a framework that adaptively tunes key hyperparameters, notably the real data ratio (β), to balance estimation error and model bias in reinforcement learning.
- It integrates a meta-controller via the AutoMBPO framework, treating hyperparameter scheduling as a sequence decision process and optimizing dynamic adjustments based on training state.
- Empirical evaluations show that DHMBPO improves sample efficiency and performance over fixed settings, with validated gains across continuous-control benchmarks.
Double-Horizon MBPO (DHMBPO) refers to methodologies for the principled scheduling of the real data ratio and other hyperparameters in Model-Based Policy Optimization (MBPO) within model-based reinforcement learning. Central to recent advances is the AutoMBPO framework, which casts hyperparameter scheduling as a sequence decision process and introduces theoretically-supported, adaptive strategies for the real-vs-synthetic data mixture during policy improvement. DHMBPO leverages a meta-controller that dynamically tunes critical parameters—including the real data ratio (β)—to optimize the balance between estimation error and model-bias error over the training lifecycle (Lai et al., 2021).
1. Theoretical Foundations
The primary analytical contribution is a finite-sample error bound for Dyna-style fitted value iteration (FVI), where next-state samples are drawn from either the real environment (with probability β) or a learned model (with probability 1–β). The main theorem (Theorem 3.2) provides a high-probability upper bound:
Non-constant terms interpret as follows:
- Estimation error (second term) scales with , increasing with high β at low N₍real₎.
- Model-bias error (third term) depends on model error σ and shrinks as β increases, vanishing at β → 1.
- Truncation error (fourth term) diminishes exponentially with K.
Thus, optimal scheduling requires β to increase monotonically with N₍real₎, balancing diminishing estimation and model-bias errors as real experience accrues. A plausible implication is that static choices of β are sub-optimal across the training horizon (Lai et al., 2021).
2. Hyperparameter Scheduling via AutoMBPO
AutoMBPO operationalizes DHMBPO through a meta-controller that treats hyperparameter selection as a Markov Decision Process (hyper-MDP):
- State: Six features, including (real transitions collected), model validation loss , critic loss , policy shift , recent return, and current hyperparameter values {β, G, k}.
- Action: Discrete choices to multiplicatively adjust β ({1/c, 1, c}), increment/decrement G (SAC gradient steps), increment/decrement k (model rollout length), and a model-update flag.
- Reward: Average environment return, observed every H real-environment steps.
The hyper-controller is trained (e.g., with PPO) to maximize expected return by episodically resetting MBPO and using collected meta-experience tuples {(state, action, reward)} for policy improvement.
3. Practical Scheduling Policies
Empirically discovered β-schedules exhibit monotonic growth aligned with theoretical predictions. Representative examples:
| Task | Initial β | Final β | Real Steps to Final β |
|---|---|---|---|
| Hopper (Mujoco) | 0.05 | 1.0 | 100K |
| Ant (Mujoco) | 0.05 | 0.8 | 150K |
| Humanoid (Mujoco) | 0.05 | 1.0 | 300K |
| PyBullet tasks | 0.05 | 0.7–0.8 | Gradual (<300K) |
β increases roughly linearly relative to , validating the bound-driven intuition that the optimal real data ratio rises with training progress. This scheduling principle was automatic; the meta-controller consistently increased β as real data accumulated (Lai et al., 2021).
4. Algorithmic Integration
Integration into standard MBPO implementations involves:
- Two replay buffers: (real) and (synthetic).
- Every τ real steps, the meta-controller schedules β, G, k, and the model-update flag.
- When flagged, the dynamics model ensemble is retrained.
- For data generation, k-step rollouts are sampled from using each ensemble model, populating .
- Each policy optimization batch samples β·batch_size from and (1–β)·batch_size from , conducting G gradient steps per block.
This design ensures adaptive adjustment of both policy and model learning, with parameter choices responsive to the current training state.
5. Empirical Results and Validation
Experimental results on six continuous-control benchmarks (Hopper, Ant, Humanoid; HopperBullet, Walker2dBullet, HalfCheetahBullet) demonstrate:
- Sample efficiency: AutoMBPO achieves MBPO baseline asymptotic performance using only 50–70% of the real samples.
- Performance gains: Terminal average returns are 10–30% higher than MBPO with fixed β=0.05.
- Comparison to model-free counterparts: AutoMBPO retains sample efficiency and avoids severe overfitting found in model-free SAC(20) settings.
- Ablation studies: Scheduling β alone (AutoMBPO-R) recovers the majority of performance gains versus full AutoMBPO; omitting or model loss from the hyper-state degrades performance, supporting the theoretical structure.
- Controller transfer: A meta-controller trained on one suite of PyBullet tasks generalizes to others with minimal loss.
- Statistical significance: Paired t-tests on final performance yields on all tasks, confirming improvements are robust (Lai et al., 2021).
6. Significance and Implications
The error decomposition and scheduling mechanism of DHMBPO represent a principled approach to managing the bias–variance tradeoff central to model-based RL. Gradually increasing the real data ratio β as experience accumulates is both theoretically optimal and empirically validated. AutoMBPO’s meta-learning formulation enables automatic, context-sensitive discovery of hyperparameter schedules, suggesting a general methodology for robust model-based policy optimization. The predominance of β as the critical hyperparameter is confirmed by extensive ablation, indicating that high-leverage improvements can be achieved without dense manual tuning of auxiliary parameters.
A plausible implication is that similar meta-scheduling frameworks could optimize other algorithmic schedules where error tradeoffs are dynamic and data-dependent. The verified generalization of the meta-controller across related domains further indicates potential for transferable hyperparameter scheduling strategies in continual learning settings (Lai et al., 2021).