AutoMBPO Real-vs-Synthetic Scheduling
- The paper introduces an adaptive hyperparameter scheduling approach that optimally balances real and synthetic data via a learned β, improving sample efficiency in MBPO.
- The AutoMBPO framework frames hyperparameter tuning as a hyper-MDP solved with PPO, enabling dynamic adjustments of rollout length, policy update count, and β.
- Empirical results demonstrate that a gradual increase in real data usage significantly reduces model error and boosts performance across various continuous control tasks.
Real-vs-synthetic scheduling in model-based reinforcement learning (MBRL) concerns the determination of the optimal ratio between environment (real) and model-generated (synthetic) samples used for policy optimization. The AutoMBPO framework introduces an adaptive hyperparameter scheduling approach to systematically address the allocation of real versus synthetic experience within the Model-Based Policy Optimization (MBPO) paradigm. Central to this framework is the learned adjustment of the real data ratio β, among other hyperparameters, which is shown to yield improved sample efficiency and policy performance over conventional fixed-schedule baselines. The theoretical analysis establishes the quantitative impact of β on value estimation error, motivating the progressive increase of real data reliance as learning proceeds (Lai et al., 2021).
1. Theoretical Foundation for Real-vs-Synthetic Scheduling
The analysis begins from a generalized Fitted Value Iteration (FVI) perspective, in which, at each Bellman backup, the next state is sampled from the environment with probability β and from a learned model with probability 1–β. Lemma 3.1 provides a high-probability finite-sample bound for a single backup, showing that the iteration error is controlled by empirical covering number complexity, model error, and the proportion of real experience.
The critical result is Theorem 3.2, the β-mixture FVI bound, which for K FVI iterations and N transitions per batch, states:
with the number of real transitions. The second term quantifies the statistical error due to finite real data, increasing with larger β; the third captures model compounding error, which diminishes with larger β. As grows over the course of online training, gradually increasing β simultaneously reduces both errors. This suggests a schedule where β starts small to leverage abundant model data, then increases as real data becomes more available.
2. The AutoMBPO Meta-Framework
AutoMBPO (Automatic Model-Based Policy Optimization) formulates the hyperparameter scheduling (including β, model rollout length k, policy update count G, and retrain decision flags) as a policy learning problem within a hyper-MDP. Each episode of the hyper-MDP is an entire MBPO training run. The hyper-controller, parameterized by π_Ω, acts at discrete intervals (every τ real environment steps) by observing a hyper-state and outputting adjustments to the hyperparameters.
The observed hyper-state includes:
- Cumulative real samples
- Model ensemble validation loss
- Critic loss
- Policy shift metric
- Average policy return
- Current values of
The controller’s possible actions:
- Scale up, down, or keep fixed (scaling factor )
- Toggle model retraining
- Increment/decrement (number of policy update steps)
- Increment/decrement (model rollout length)
The controller is trained using proximal policy optimization (PPO), with the reward being the average episode return relative to a baseline MBPO instance with fixed parameters.
3. Scheduled Hyperparameters in Practice
Empirical results report that the learned β schedule, , exhibits slow linear growth initially, accelerating later in training. On Hopper and MuJoCo environments, β increases from $0.05$ to approximately $0.3$ mid-training, reaching $0.5$ at termination. On Ant, β rises as high as $0.7$–$0.9$; on Humanoid, nearly $1.0$. In PyBullet tasks, the trend is similar across all environments. AutoMBPO adapts the model rollout length and policy update count concurrently, but the schedule for β remains the most consequential.
Integration into MBPO requires managing two replay buffers: for real environment samples and for synthetic model rollouts. Minibatches for policy training sample a fraction β from and from , with rollout lengths and update counts as dictated by the controller.
4. Empirical Evaluation and Analysis
AutoMBPO demonstrates statistically significant performance improvements () over both prior model-based and model-free methods across six continuous control tasks: Hopper, Ant, Humanoid (MuJoCo), and HopperBullet, Walker2dBullet, HalfCheetahBullet (PyBullet). Comparisons include original MBPO (fixed β=5%), SAC with different update ratios, population-based training (PBT), and reinforcement-on-reinforcement (RoR). In early stages, AutoMBPO matches the exploration rate of model-free SAC, subsequently surpassing it via effective synthetic data usage.
Ablation studies indicate that allowing the controller to schedule only β (AutoMBPO-R) retains the bulk of the performance improvement, while exclusive scheduling of G or model training frequency yields lesser and inconsistent gains. This provides strong evidence that real-vs-synthetic scheduling embodies the principal driver of sample efficiency gains in the MBPO setting.
5. Consistency with Theoretical Predictions
The β schedules discovered by AutoMBPO empirically increase monotonically with , in alignment with the β-mixture FVI theoretical analysis. The learned rollout schedules roughly match heuristic hand-designed schedules, but the central theoretical implication—a gradual shift from synthetic to real data—is consistently observed in the learned policies. Such robustness supports the theoretical perspective that starting with synthetic data and transitioning to predominantly real experience reduces both model-bias and statistical error as learning progresses.
6. Transferability and Robustness Characteristics
Controllers trained by AutoMBPO exhibit task transferability: a hyper-controller trained on one PyBullet environment performs effectively on others without retraining. Experimental results also show that AutoMBPO's outcomes are robust to variations in initial hyperparameter settings , indicating insensitivity to controller initialization.
7. Summary and Implications
The combination of theoretical bounds and empirical evidence demonstrates that dynamic scheduling of the real data ratio β—“real-vs-synthetic scheduling”—is essential for optimizing performance in model-based RL. AutoMBPO provides a scalable and transferable solution, automatically learning an effective curriculum for integrating real and synthetic samples within MBPO. This advances the understanding of how to mitigate model-bias and statistical variance trade-offs, and establishes β-scheduling as a primary hyperparameter for model-based RL optimization (Lai et al., 2021).