On Training in Imagination

Published 7 May 2026 in cs.LG | (2605.06732v1)

Abstract: State-of-the-art model-based reinforcement learning methods train policies on imagined rollouts. These rollouts are trajectories generated by a learned dynamics model and are scored by a learned reward model, but without querying the true environment during policy updates. We study this training paradigm by quantifying how errors in learned dynamics and reward models affect returns and policy optimization. First, we extend the analysis of Asadi et al. (2018) to MDPs with learned reward models, and derive the optimal sample allocation--the ratio of dynamics samples to reward samples that minimizes a bound on return error under power-law scaling assumptions. We identify lower Lipschitz constants of the learned dynamics, reward, and policy as a representation desideratum that tightens this bound, and we connect this perspective to the temporal-straightening objective of Wang et al. (2026). Second, we examine how policy optimization with REINFORCE tolerates noisy rewards, which are often cheaper to obtain. We show that zero-mean reward noise leaves the gradient estimator unbiased and adds at most a variance term that decreases with the number of rollouts. This introduces a practical tradeoff: given a fixed budget, should one buy more rollouts with cheaper but noisier rewards, or fewer rollouts with more expensive but less noisy rewards? We reduce this choice to a one-dimensional optimization problem and characterize the optimum.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper establishes a rigorous decomposition of return error by attributing policy suboptimality to errors in learned dynamics and reward models via Lipschitz constants.
It demonstrates that reducing Lipschitz constants through techniques like spectral normalization tightens error bounds, enhancing long-horizon policy performance.
It provides an optimal sample allocation formula under power-law error scaling, balancing reward fidelity and rollout quantity to mitigate noise and bias.

Training in Imagination: Error Attribution, Sample Allocation, and Reward Fidelity in Model-Based RL

Introduction

The paradigm of "training in imagination," wherein agent policies are optimized exclusively on trajectories simulated by learned dynamics and reward models, has gained increasing prominence in model-based RL. Prominent recent systems such as Dreamer3 and Dreamer4 instantiate this approach, leveraging learned world models to facilitate efficient and scalable control with minimal environment interaction. Despite clear empirical progress, theoretical understanding of the interplay between dynamics and reward model errors—and their implications for sample complexity, policy performance, and annotation budget allocation—has lagged. This work provides a rigorous quantitative analysis of error propagation and sample allocation in model-based RL with learned reward models, delivering formal decompositions, sample complexity results, and guidelines for budget allocation under realistic power-law error scaling and reward noise.

Decomposition of Return Error and Attribution

The central technical contribution is an extension of Lipschitz-based simulation-lemma-style bounds to settings where both the dynamics ( $\hat{f}$ ) and the reward model ( $\hat{r}$ ) are learned. The derived error decomposition establishes a bound on the policy return gap between the imagined and true environments that separates contributions from dynamics and reward model errors, each modulated by explicit, independently-controlled coefficients parameterized by their respective Lipschitz constants. Formally, under contraction conditions ( $\gamma L_f(1+L_\pi)<1$ ), the following bound for a fixed Lipschitz policy $\pi$ is established:

$| J(\pi, \mathcal{M}) - J(\pi, \hat{\mathcal{M}}) | \leq \frac{1}{1-\gamma} \, \varepsilon_r + \frac{\gamma L_r (1 + L_\pi)}{(1-\gamma)(1-\gamma L_f(1+L_\pi))} \, \varepsilon_f$

where $\varepsilon_f$ and $\varepsilon_r$ are the worst-case $\ell_2$ dynamics and reward model errors, and $L_f, L_r, L_\pi$ are the Lipschitz constants of the learned dynamics, reward, and policy, respectively.

This explicitly attributes the return suboptimality to the respective error sources and provides a framework for independent control and mitigation.

Figure 1: Dynamics and reward errors exhibit distinct power-law scaling with dataset size; reward model error typically decays $\approx9\times$ faster with increasing data—crucial for sample allocation.

Representation Desiderata: Tightening the Error Bound

The bound demonstrates that the impact of dynamics errors compounds along the rollout trajectory with a coefficient that is monotone in $\hat{r}$ 0, $\hat{r}$ 1, and $\hat{r}$ 2. Consequently, learned latent representations that reduce these Lipschitz constants directly tighten the value gap bound, motivating regularization schemes such as spectral normalization and objectives like temporal straightening in latent space.

The link between smooth, contractive latent dynamics and tighter error bounds is explicit: lower Lipschitz constants attenuate the exponential accumulation of compounding model errors, particularly in long-horizon tasks. The temporal straightening objective, enforcing parallelism in consecutive latent velocities, is shown to upper-bound incremental rollout curvature by a function of the latent velocity map's Lipschitz constant. Minimizing this regularity yields better long-horizon accuracy in imagination.

Sample Budget Allocation under Power-Law Error Scaling

A key practical question is how to divide a finite annotation and interaction budget between dynamics and reward data, given that reward annotations (e.g., human preferences) are often more expensive and may scale differently with sample size. Empirical analyses confirm that, for standard neural architectures, reward model error decays with training data at rates nearly an order of magnitude faster than dynamics error (see Figure 1). Theoretically, if the error curves scale as $\hat{r}$ 3 and $\hat{r}$ 4 for $\hat{r}$ 5 samples, the optimal allocation ratio is

$\hat{r}$ 6

where $\hat{r}$ 7 and $\hat{r}$ 8 are per-sample costs. This formula quantifies how the power-law exponents, cost ratios, and error coefficients jointly determine optimal budget distribution; for instance, favorable reward error exponents and lower annotation cost shift the optimal mix toward reward samples.

Figure 2: Agreement between predicted and realized sample allocation ratios when leveraging value-function sensitivities, demonstrating practical accuracy of the optimal sample split formula.

Figure 3: The derived decomposed return error bound holds empirically, but with notable looseness when instantiating global Lipschitz constants.

Figure 4: The multiplier in the optimal sample ratio formula, when computed with practical Lipschitz constants, can overshoot realized ratios by orders of magnitude; local value-function sensitivities are recommended for more accurate allocation.

Empirically, the proportionality relationship between realized allocation and error ratios holds strongly when value-function sensitivities are used, but the bound is loose—by up to three orders of magnitude—when only global Lipschitz constants are available.

Analysis of Policy Gradient Variance: Reward Noise and Fidelity

In model-based RL with imitation or preference data, reward signals are often subject to stochastic noise or systematic bias. The analysis considers the consequences of noisy reward observations for policy gradient estimation. When noise is zero-mean, the expected gradient is unbiased, but the variance is inflated by an explicit term proportional to per-step noise variance and inversely with the number of rollouts. This gives rise to a fidelity–quantity tradeoff: for a fixed annotation budget, should one prioritize fewer high-fidelity (low-noise) samples, or increase rollout count at the expense of higher per-step noise?

The optimal fidelity resolves as a one-dimensional minimization over the product $\hat{r}$ 9, where $\gamma L_f(1+L_\pi)<1$ 0 is noise variance as a decreasing function of per-rollout annotation cost $\gamma L_f(1+L_\pi)<1$ 1. Analytical cases show that when $\gamma L_f(1+L_\pi)<1$ 2 drops sufficiently fast with $\gamma L_f(1+L_\pi)<1$ 3, it is optimal to buy higher-fidelity data; otherwise, maximizing rollout quantity is preferable. Crucially, if reward noise has an irreducible floor, the optimal strategy is to maximize rollout number regardless of fidelity cost. In contrast, systematic reward biases introduce persistent, nonvanishing gradient bias that trajectory averaging cannot remedy.

Empirical Calibration and Numerical Evaluations

Extensive empirical studies on synthetic dynamics/reward environments and LQG control tasks illustrate several crucial findings:

The decomposed return error bound is always satisfied but is typically conservative by at least an order of magnitude.
Power-law scaling exponents extracted from actual model learning trajectories confirm the marked gap between reward and dynamics learning rates.
Optimal sample ratios derived from the theoretical framework align with observed minimal return error allocations when realized value-function sensitivities are substituted for worst-case Lipschitz constants.

This underscores that practical systems should avoid the use of global Lipschitz overestimates when tuning data acquisition pipelines for model-based RL.

Figure 5: Empirical reward and dynamics model errors versus dataset size, confirming distinct power-law regimes; reward error decays nearly an order of magnitude faster per decade.

Implications and Future Directions

The analysis provides explicit, actionable guidelines for model-based RL practitioners building learning systems around imagined rollouts:

Prioritize model architectures and regularization choices that directly minimize the Lipschitz constants of learned system components to limit long-horizon compounding of errors.
Use empirical power-law scaling exponents to inform data acquisition decisions; reward annotations can often be successful with much less data than dynamics transitions.
Where possible, use realized local value-function sensitivities—not global model Lipschitz constants—to calibrate sample allocations, as the latter grossly overestimate required data.
For reward annotation, adopt fidelity regimes guided by the empirical behavior of annotation noise as a function of cost, maximizing rollout quantity unless high-fidelity yields a superlinear improvement in variance reduction.
Detect and mitigate systematic reward bias early; averaging cannot remove bias-induced gradient drift.

The theoretical machinery presented here forms a foundation for future study of sample-efficient RL under imperfect generative models, especially in offline and human-in-the-loop scenarios. Of particular interest are extensions to stochastic dynamics, unbounded operating regimes, and dynamic weighting of annotation fidelity in nonstationary environments.

Conclusion

This work delivers a comprehensive and rigorous quantitative analysis of policy error decomposition, representation regularity, sample allocation, and fidelity-variance tradeoffs in model-based RL with learned dynamics and reward functions. The derived bounds, optimality results, and empirical findings guide both theoretical understanding and practical implementation. The connection between tight Lipschitz-induced bounds and representation learning objectives (e.g., temporal straightening) establishes a critical link between architectural choices and downstream policy performance. These results provide a toolkit for principled, cost-aware training of high-performing RL agents in imagination, contributing formally to the methodology underpinning efficient RL from limited and noisy data (2605.06732).

Markdown Report Issue