Papers
Topics
Authors
Recent
Search
2000 character limit reached

Double-Horizon MBPO: Adaptive Hyperparameter Scheduling

Updated 23 March 2026
  • DHMBPO is a framework that adaptively tunes key hyperparameters, notably the real data ratio (β), to balance estimation error and model bias in reinforcement learning.
  • It integrates a meta-controller via the AutoMBPO framework, treating hyperparameter scheduling as a sequence decision process and optimizing dynamic adjustments based on training state.
  • Empirical evaluations show that DHMBPO improves sample efficiency and performance over fixed settings, with validated gains across continuous-control benchmarks.

Double-Horizon MBPO (DHMBPO) refers to methodologies for the principled scheduling of the real data ratio and other hyperparameters in Model-Based Policy Optimization (MBPO) within model-based reinforcement learning. Central to recent advances is the AutoMBPO framework, which casts hyperparameter scheduling as a sequence decision process and introduces theoretically-supported, adaptive strategies for the real-vs-synthetic data mixture during policy improvement. DHMBPO leverages a meta-controller that dynamically tunes critical parameters—including the real data ratio (β)—to optimize the balance between estimation error and model-bias error over the training lifecycle (Lai et al., 2021).

1. Theoretical Foundations

The primary analytical contribution is a finite-sample error bound for Dyna-style fitted value iteration (FVI), where next-state samples are drawn from either the real environment (with probability β) or a learned model (with probability 1–β). The main theorem (Theorem 3.2) provides a high-probability upper bound:

VVπKp,ρ2γ(1γ)2Cρ,μ1/pdp,μ(BF,F)+O((βANreal(log(Nreal/(βA))+log(K/δ)))1/(2p))\Vert V^* - V^{\pi_K} \Vert_{p,\rho} \leq \frac{2\gamma}{(1-\gamma)^2} C_{\rho,\mu}^{1/p} d_{p,\mu}(B\mathcal{F},\mathcal{F}) + O\Bigg(\left(\frac{\beta |A|}{N_{\text{real}} \big(\log(N_{\text{real}}/(\beta|A|)) + \log(K/\delta)\big)}\right)^{1/(2p)}\Bigg)

+O(Φ1(1βδ8KNreal(1β))σ)+O(γK/pVmax)+\, O\left(\Phi^{-1}\left(1-\frac{\beta \delta}{8K N_{\text{real}}(1-\beta)}\right)\sigma\right) + O(\gamma^{K/p}V_{\max})

Non-constant terms interpret as follows:

  • Estimation error (second term) scales with β/Nreal\sqrt{\beta / N_{\text{real}}}, increasing with high β at low N₍real₎.
  • Model-bias error (third term) depends on model error σ and shrinks as β increases, vanishing at β → 1.
  • Truncation error (fourth term) diminishes exponentially with K.

Thus, optimal scheduling requires β to increase monotonically with N₍real₎, balancing diminishing estimation and model-bias errors as real experience accrues. A plausible implication is that static choices of β are sub-optimal across the training horizon (Lai et al., 2021).

2. Hyperparameter Scheduling via AutoMBPO

AutoMBPO operationalizes DHMBPO through a meta-controller that treats hyperparameter selection as a Markov Decision Process (hyper-MDP):

  • State: Six features, including NrealN_{\text{real}} (real transitions collected), model validation loss LT^\mathcal{L}_{\hat{T}}, critic loss LQ\mathcal{L}_Q, policy shift ϵπ\epsilon_\pi, recent return, and current hyperparameter values {β, G, k}.
  • Action: Discrete choices to multiplicatively adjust β ({1/c, 1, c}), increment/decrement G (SAC gradient steps), increment/decrement k (model rollout length), and a model-update flag.
  • Reward: Average environment return, observed every H real-environment steps.

The hyper-controller is trained (e.g., with PPO) to maximize expected return by episodically resetting MBPO and using collected meta-experience tuples {(state, action, reward)} for policy improvement.

3. Practical Scheduling Policies

Empirically discovered β-schedules exhibit monotonic growth aligned with theoretical predictions. Representative examples:

Task Initial β Final β Real Steps to Final β
Hopper (Mujoco) 0.05 1.0 100K
Ant (Mujoco) 0.05 0.8 150K
Humanoid (Mujoco) 0.05 1.0 300K
PyBullet tasks 0.05 0.7–0.8 Gradual (<300K)

β increases roughly linearly relative to NrealN_{\text{real}}, validating the bound-driven intuition that the optimal real data ratio rises with training progress. This scheduling principle was automatic; the meta-controller consistently increased β as real data accumulated (Lai et al., 2021).

4. Algorithmic Integration

Integration into standard MBPO implementations involves:

  • Two replay buffers: DenvD_{\text{env}} (real) and DmodelD_{\text{model}} (synthetic).
  • Every τ real steps, the meta-controller schedules β, G, k, and the model-update flag.
  • When flagged, the dynamics model ensemble is retrained.
  • For data generation, k-step rollouts are sampled from DenvD_{\text{env}} using each ensemble model, populating DmodelD_{\text{model}}.
  • Each policy optimization batch samples β·batch_size from DenvD_{\text{env}} and (1–β)·batch_size from DmodelD_{\text{model}}, conducting G gradient steps per block.

This design ensures adaptive adjustment of both policy and model learning, with parameter choices responsive to the current training state.

5. Empirical Results and Validation

Experimental results on six continuous-control benchmarks (Hopper, Ant, Humanoid; HopperBullet, Walker2dBullet, HalfCheetahBullet) demonstrate:

  • Sample efficiency: AutoMBPO achieves MBPO baseline asymptotic performance using only 50–70% of the real samples.
  • Performance gains: Terminal average returns are 10–30% higher than MBPO with fixed β=0.05.
  • Comparison to model-free counterparts: AutoMBPO retains sample efficiency and avoids severe overfitting found in model-free SAC(20) settings.
  • Ablation studies: Scheduling β alone (AutoMBPO-R) recovers the majority of performance gains versus full AutoMBPO; omitting NrealN_{\text{real}} or model loss from the hyper-state degrades performance, supporting the theoretical structure.
  • Controller transfer: A meta-controller trained on one suite of PyBullet tasks generalizes to others with minimal loss.
  • Statistical significance: Paired t-tests on final performance yields p<0.01p<0.01 on all tasks, confirming improvements are robust (Lai et al., 2021).

6. Significance and Implications

The error decomposition and scheduling mechanism of DHMBPO represent a principled approach to managing the bias–variance tradeoff central to model-based RL. Gradually increasing the real data ratio β as experience accumulates is both theoretically optimal and empirically validated. AutoMBPO’s meta-learning formulation enables automatic, context-sensitive discovery of hyperparameter schedules, suggesting a general methodology for robust model-based policy optimization. The predominance of β as the critical hyperparameter is confirmed by extensive ablation, indicating that high-leverage improvements can be achieved without dense manual tuning of auxiliary parameters.

A plausible implication is that similar meta-scheduling frameworks could optimize other algorithmic schedules where error tradeoffs are dynamic and data-dependent. The verified generalization of the meta-controller across related domains further indicates potential for transferable hyperparameter scheduling strategies in continual learning settings (Lai et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Double-Horizon MBPO (DHMBPO).