Double Horizon Model-Based Policy Optimization
- The paper's main contribution is decomposing synthetic rollouts into a long distribution rollout (DR) and a short training rollout (TR) to resolve competing bias-variance challenges in model-based RL.
- It employs a two-horizon approach that improves state distribution alignment while stabilizing gradient estimation for efficient actor-critic updates.
- Empirical evaluations on continuous control benchmarks demonstrate that DHMBPO outperforms state-of-the-art methods in sample efficiency and runtime performance.
Double Horizon Model-Based Policy Optimization (DHMBPO) is a framework for model-based reinforcement learning (MBRL) that decomposes the synthetic trajectory generation process into two distinct rollouts, each with a different horizon, to balance distribution shift, model bias, and gradient variance. By separating the generation of on-policy state distributions from the computation of value gradients, DHMBPO resolves conflicting requirements on rollout length that are fundamental to MBRL. This approach achieves superior sample efficiency and runtime performance on a variety of continuous control tasks compared to state-of-the-art MBRL methods (Kubo et al., 17 Dec 2025).
1. Motivation and Background
1.1 Rollout-Length Dilemmas in Model-Based RL
A central challenge in MBRL is the selection of model rollout length. Increasing rollout length reduces distribution shift—the difference between the current policy’s state-action occupancy and the data collected previously—but also incurs higher model bias due to compounding errors in trajectories generated by the learned model. Conversely, short rollouts reduce model bias, since predictions stay close to observed states, but the distributional mismatch with current policy trajectories can slow policy optimization.
When using differentiable model-based updates (e.g., stochastic value gradients), a longer rollout reduces value estimation bias but sharply increases the variance of policy gradients, especially due to backpropagation through multiple stochastic model transitions.
1.2 Distribution Rollouts and Training Rollouts
DHMBPO introduces two types of rollouts, each addressing one side of the rollout-length dilemma:
- Distribution Rollout (DR): A long rollout of horizon , used solely to sample synthetic on-policy states and transitions from the model. No differentiation is performed through the DR. The resulting buffer approximates the state-action marginal of the current policy, mitigating distribution shift.
- Training Rollout (TR): A short, fully differentiable rollout of horizon starting from a DR-generated synthetic on-policy state. The TR facilitates accurate value gradient computation for actor-critic updates while keeping gradient variance manageable. Model-based value expansion (MVE) methods are used during TR.
This two-horizon approach allows separate tuning of the model-based buffer’s distributional alignment and the stability of gradient-based optimization.
2. Algorithmic Formulation
2.1 Reinforcement-Learning Objective and Architecture
DHMBPO operates in a Markov decision process (MDP) with discounted return
with actor-critic architecture and a learned dynamics-reward model . The critic is fit to combined real and model-generated data. The policy is optimized by maximizing value estimates produced via short differentiable rollouts.
Surrogate Losses
- Critic Loss: Mean-squared error to MVE-based targets computed using a TR of length from synthetic on-policy states generated by DR:
- Actor Update: Maximize expected value with respect to value gradients backpropagated through differentiable model steps:
2.2 DHMBPO Two-Horizon Protocol
- : Long horizon for DR (e.g., 20 steps) improves the coverage of the on-policy state-action distribution.
- : Short horizon for TR (e.g., 5 steps) controls the variance of gradients by differentiating through a limited number of model transitions.
DHMBPO Pseudocode (Condensed)
| Step | Description |
|---|---|
| Real data collection | Interact with environment using ; store in replay buffer |
| Model fitting | Update on samples from |
| DR rollouts | Generate -step synthetic trajectories; store in model buffer |
| Policy-critic updates (UTD loop) | For each update mini-batch from , perform -step differentiable TR, update critic and actor |
| Buffer management | Clear after each epoch |
Typical parameters: , , batch size 256, update-to-data ratio 1 (Kubo et al., 17 Dec 2025).
3. Theoretical Insights
The two-horizon framework in DHMBPO is rooted in the decomposition of three sources of error:
- Distribution-Shift Error: Sampling directly from the replay buffer () incurs significant off-policy bias. Using a longer DR () better samples the true on-policy occupancy, with a cost that scales as , where is the one-step model error.
- Value-Estimation Bias: Longer TR reduces the bias by using more true model rewards and deferring reliance on the model-free critic. This bias decays , but excessive is impractical due to instability.
- Gradient Variance: The variance in the value gradient through differentiable rollouts increases rapidly with (exponential in the worst case due to chaos). Empirical observations indicate instability for .
The separation of DR and TR allows independent control of these error sources: DR is set long to ensure distributional fidelity; TR is kept short for gradient stability, providing a practical resolution to the rollout dilemma.
4. Experimental Evaluation
4.1 Benchmarks and Hyperparameters
DHMBPO was evaluated across two major continuous control suites:
- OpenAI Gym (MuJoCo): Ant, HalfCheetah, Hopper, Humanoid, Walker2d
- DeepMind Control Suite (DMC): 18 tasks spanning different morphologies and dynamics
Common hyperparameters for all experiments included discount , batch size 256, DR horizon , TR horizon , replay buffer size 1 million, model ensemble size 8, critic ensemble size 5, and learning rates of (actor/critic) and (model).
4.2 Ablation Studies on Horizon Lengths
Three regimes were compared:
| Variant | DR Horizon () | TR Horizon () | Observed Behavior |
|---|---|---|---|
| MBPO (DR-only) | 20 | 0 | Suffers high model-bias in value function; fast runtime |
| SVG (TR-only) | 0 | 5 | Incurs off-policy bias; fails to recover on-policy states |
| DHMBPO | 20 | 5 | Outperforms both, with synergy between low bias and low variance |
Best aggregate performance was attained at . For , gradient variance became prohibitive; for shorter , value-estimation bias was excessive.
4.3 Comparison to State-of-the-Art
DHMBPO demonstrated strong sample efficiency and runtime advantages:
- Gym tasks (to 500 K steps):
- Reached high returns faster than MBPO, SAC-SVG(H), and MACURA.
- For Ant: 3.6h (DHMBPO), 5.2h (SAC-SVG(H)), 58.3h (MACURA). Mean runtime ratio: 1.0 (DHMBPO), 1.6 (SAC-SVG(H)), 16.8 (MACURA).
- DMC tasks (to 250 K or 500 K steps):
- Outperformed TD-MPC2 and Dreamer v3 in IQM and median return, at matched or lower runtime.
5. Practical Recommendations and Limitations
5.1 Horizon Selection and Implementation Guidelines
- DR horizon (): Set moderately long (e.g., 20) to ensure the model buffer effectively approximates the on-policy state distribution without excessive model bias.
- TR horizon (): Kept short (≈5) to prevent gradient variance explosion but still reduce value-estimation bias relative to purely model-free targets.
Intermediate validation can be performed by measuring on offline replay-buffer states the trade-off between the bias of and the sample-average standard error, choosing the largest before variance dominates.
5.2 Limitations and Open Problems
- Exploration: DHMBPO does not introduce any mechanism for directed exploration beyond the current policy distribution. The DR step imitates current policy-induced distributions and does not encourage visiting novel states.
- Deceptive or Sparse-Reward Tasks: In tasks with local optima or sparse rewards (e.g., finger-spin), DHMBPO can stagnate. Augmenting exploration through auxiliary objectives or alternative criteria remains an open direction.
- Theory of Bias/Variance Tradeoff: While the two-horizon approach is operationally justified and empirically validated, a complete theoretical characterization of bias and variance as functions of and remains unresolved.
A plausible implication is that the general principle of decomposing model rollouts by distribution fitting and value expansion is extensible to other MBRL architectures that decouple sampling and optimization (Kubo et al., 17 Dec 2025).
6. Relationship to Bidirectional MBPO and Related Work
DHMBPO generalizes the concept of error-compounding control seen in Bidirectional Model-based Policy Optimization (BMPO) (Lai et al., 2020), which simultaneously uses forward and backward model ensembles and splits rollout branches to reduce maximum compounding steps per model. BMPO achieves a provably tighter bound on return discrepancy via this branched rollout structure. In contrast, DHMBPO uses a distribution-training separation to resolve a different pair of algorithmic tradeoffs—distribution shift and gradient variance—by decoupling the state-distribution-sampling function (DR) from the gradient-estimation function (TR).
7. Summary Table: Role of Horizons in DHMBPO
| Component | Horizon Name | Typical Value | Primary Function |
|---|---|---|---|
| Distribution Rollout (DR) | 20 | Approximates on-policy state-action marginal for off-policy actor-critic updates | |
| Training Rollout (TR) | 5 | Differentiable rollout for value gradient estimation, controls gradient variance |
Rigorous ablation and benchmark results indicate that the interplay between these horizons is critical to the method’s empirical performance, enabling DHMBPO to achieve a favorable balance among sample efficiency, return performance, and runtime cost (Kubo et al., 17 Dec 2025).