Papers
Topics
Authors
Recent
Search
2000 character limit reached

AutoMBPO Real-vs-Synthetic Scheduling

Updated 23 March 2026
  • The paper introduces an adaptive hyperparameter scheduling approach that optimally balances real and synthetic data via a learned β, improving sample efficiency in MBPO.
  • The AutoMBPO framework frames hyperparameter tuning as a hyper-MDP solved with PPO, enabling dynamic adjustments of rollout length, policy update count, and β.
  • Empirical results demonstrate that a gradual increase in real data usage significantly reduces model error and boosts performance across various continuous control tasks.

Real-vs-synthetic scheduling in model-based reinforcement learning (MBRL) concerns the determination of the optimal ratio between environment (real) and model-generated (synthetic) samples used for policy optimization. The AutoMBPO framework introduces an adaptive hyperparameter scheduling approach to systematically address the allocation of real versus synthetic experience within the Model-Based Policy Optimization (MBPO) paradigm. Central to this framework is the learned adjustment of the real data ratio β, among other hyperparameters, which is shown to yield improved sample efficiency and policy performance over conventional fixed-schedule baselines. The theoretical analysis establishes the quantitative impact of β on value estimation error, motivating the progressive increase of real data reliance as learning proceeds (Lai et al., 2021).

1. Theoretical Foundation for Real-vs-Synthetic Scheduling

The analysis begins from a generalized Fitted Value Iteration (FVI) perspective, in which, at each Bellman backup, the next state is sampled from the environment with probability β and from a learned model with probability 1–β. Lemma 3.1 provides a high-probability finite-sample bound for a single backup, showing that the iteration error is controlled by empirical covering number complexity, model error, and the proportion of real experience.

The critical result is Theorem 3.2, the β-mixture FVI bound, which for K FVI iterations and N transitions per batch, states:

VVπKp,ρ2γ(1γ)2Cρ,μ1/pdp,μ(BF,F)+O((βANreal[log(Nreal/(βA))+log(K/δ)])1/(2p))+O(Φ1(1βδ8KNreal(1β))σ)+O(γK/pVmax)\lVert V^* - V^{\pi_K} \rVert_{p,\rho} \leq \frac{2\gamma}{(1-\gamma)^2 C_{\rho,\mu}^{1/p}} d_{p,\mu}(B\mathcal{F}, \mathcal{F}) + O\left( \left(\frac{\beta |\mathcal{A}|}{N_{\mathrm{real}}} [\log(N_{\mathrm{real}}/(\beta|\mathcal{A}|)) + \log(K/\delta)]\right)^{1/(2p)}\right ) + O\left (\Phi^{-1}\left(1-\frac{\beta\delta}{8KN_{\mathrm{real}}(1-\beta)}\right) \sigma \right ) + O(\gamma^{K/p} V_{\mathrm{max}})

with Nreal=βANN_{\mathrm{real}} = \beta |\mathcal{A}| N the number of real transitions. The second term quantifies the statistical error due to finite real data, increasing with larger β; the third captures model compounding error, which diminishes with larger β. As NrealN_{\mathrm{real}} grows over the course of online training, gradually increasing β simultaneously reduces both errors. This suggests a schedule where β starts small to leverage abundant model data, then increases as real data becomes more available.

2. The AutoMBPO Meta-Framework

AutoMBPO (Automatic Model-Based Policy Optimization) formulates the hyperparameter scheduling (including β, model rollout length k, policy update count G, and retrain decision flags) as a policy learning problem within a hyper-MDP. Each episode of the hyper-MDP is an entire MBPO training run. The hyper-controller, parameterized by π_Ω, acts at discrete intervals (every τ real environment steps) by observing a hyper-state shyps_{\mathrm{hyp}} and outputting adjustments ahypa_{\mathrm{hyp}} to the hyperparameters.

The observed hyper-state includes:

  • Cumulative real samples NrealN_{\text{real}}
  • Model ensemble validation loss Lmodel\mathcal{L}_{\text{model}}
  • Critic loss LQ\mathcal{L}_Q
  • Policy shift metric ϵπ=E(s,a)π(as)πdata(as)\epsilon_\pi = \mathbb{E}_{(s,a)}|\pi(a|s) - \pi_{\text{data}}(a|s)|
  • Average policy return
  • Current values of (β,G,k)(\beta, G, k)

The controller’s possible actions:

  • Scale β\beta up, down, or keep fixed (scaling factor c=1.2c=1.2)
  • Toggle model retraining
  • Increment/decrement GG (number of policy update steps)
  • Increment/decrement kk (model rollout length)

The controller is trained using proximal policy optimization (PPO), with the reward being the average episode return relative to a baseline MBPO instance with fixed parameters.

3. Scheduled Hyperparameters in Practice

Empirical results report that the learned β schedule, β(t)\beta(t), exhibits slow linear growth initially, accelerating later in training. On Hopper and MuJoCo environments, β increases from $0.05$ to approximately $0.3$ mid-training, reaching $0.5$ at termination. On Ant, β rises as high as $0.7$–$0.9$; on Humanoid, nearly $1.0$. In PyBullet tasks, the trend is similar across all environments. AutoMBPO adapts the model rollout length and policy update count concurrently, but the schedule for β remains the most consequential.

Integration into MBPO requires managing two replay buffers: DenvD_{\mathrm{env}} for real environment samples and DmodelD_{\mathrm{model}} for synthetic model rollouts. Minibatches for policy training sample a fraction β from DenvD_{\mathrm{env}} and 1β1-\beta from DmodelD_{\mathrm{model}}, with rollout lengths and update counts as dictated by the controller.

4. Empirical Evaluation and Analysis

AutoMBPO demonstrates statistically significant performance improvements (p<0.05p<0.05) over both prior model-based and model-free methods across six continuous control tasks: Hopper, Ant, Humanoid (MuJoCo), and HopperBullet, Walker2dBullet, HalfCheetahBullet (PyBullet). Comparisons include original MBPO (fixed β=5%), SAC with different update ratios, population-based training (PBT), and reinforcement-on-reinforcement (RoR). In early stages, AutoMBPO matches the exploration rate of model-free SAC, subsequently surpassing it via effective synthetic data usage.

Ablation studies indicate that allowing the controller to schedule only β (AutoMBPO-R) retains the bulk of the performance improvement, while exclusive scheduling of G or model training frequency yields lesser and inconsistent gains. This provides strong evidence that real-vs-synthetic scheduling embodies the principal driver of sample efficiency gains in the MBPO setting.

5. Consistency with Theoretical Predictions

The β schedules discovered by AutoMBPO empirically increase monotonically with NrealN_{\mathrm{real}}, in alignment with the β-mixture FVI theoretical analysis. The learned rollout schedules roughly match heuristic hand-designed schedules, but the central theoretical implication—a gradual shift from synthetic to real data—is consistently observed in the learned policies. Such robustness supports the theoretical perspective that starting with synthetic data and transitioning to predominantly real experience reduces both model-bias and statistical error as learning progresses.

6. Transferability and Robustness Characteristics

Controllers trained by AutoMBPO exhibit task transferability: a hyper-controller trained on one PyBullet environment performs effectively on others without retraining. Experimental results also show that AutoMBPO's outcomes are robust to variations in initial hyperparameter settings (β0,G0)(\beta_0, G_0), indicating insensitivity to controller initialization.

7. Summary and Implications

The combination of theoretical bounds and empirical evidence demonstrates that dynamic scheduling of the real data ratio β—“real-vs-synthetic scheduling”—is essential for optimizing performance in model-based RL. AutoMBPO provides a scalable and transferable solution, automatically learning an effective curriculum for integrating real and synthetic samples within MBPO. This advances the understanding of how to mitigate model-bias and statistical variance trade-offs, and establishes β-scheduling as a primary hyperparameter for model-based RL optimization (Lai et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Real-vs-Synthetic Scheduling (AutoMBPO).