Papers
Topics
Authors
Recent
Search
2000 character limit reached

Scalable and Optimistic MBRL (SOMBRL)

Updated 3 July 2026
  • SOMBRL is an advanced model-based RL approach that injects optimism to improve exploration and sample efficiency in sparse-reward settings.
  • It integrates three formulations—uncertainty-aware modeling, optimistic world models with RBMLE, and noise-augmented MDPs—to balance exploration and exploitation.
  • Empirical results show that SOMBRL enhances convergence speed and performance on continuous-control and visual tasks while maintaining scalability with deep architectures.

Scalable and Optimistic Model-Based Reinforcement Learning (SOMBRL) is an approach to efficient exploration in model-based reinforcement learning under unknown dynamics, where the agent learns directly from online interactions and applies optimism in the face of uncertainty in a form intended to remain compatible with scalable planners, policy optimizers, and deep world-model architectures. In the literature considered here, SOMBRL appears in at least three closely related formulations: an uncertainty-aware dynamics-model framework with an intrinsic uncertainty bonus, a world-model instantiation based on Reward-Biased Maximum Likelihood Estimation (RBMLE) and an optimistic dynamics loss, and an earlier tractable optimism formulation based on a noise-augmented Markov decision process (Sukhija et al., 25 Nov 2025, Mete et al., 10 Feb 2026, Pacchiano et al., 2020).

1. Problem setting and motivation

Efficient exploration is a central challenge in reinforcement learning, particularly in sparse-reward environments. In model-based RL, the difficulty is acute because the agent must simultaneously identify unknown dynamics and exploit the current model for planning or policy optimization. One formulation writes the system as

xt+1=f(xt,ut)+wt,x_{t+1} = f^*(x_t,u_t) + w_t,

with unknown ff^* and process noise wtw_t that is treated as zero-mean Gaussian or sub-Gaussian in the theoretical development. This setting motivates optimism-based methods that prefer actions whose consequences are either high-reward or insufficiently known (Sukhija et al., 25 Nov 2025).

In large world-model systems, the same issue appears through certainty-equivalence training. Architectures such as Dreamer, STORM, and DIAMOND learn a dynamics model by maximum likelihood on real data and then learn a policy in imagination. Because the model is “certain” to its current estimate, such systems can get stuck in poorly explored regions; the exposition on Optimistic World Models identifies this as the closed-loop identification problem. Classical OFU and UCB approaches seek exploration through confidence sets, but in deep RL they require nonconvex constraints or explicit uncertainty estimation, which makes them difficult to scale directly (Mete et al., 10 Feb 2026).

A related earlier line of work reinterprets scalable optimistic model-based algorithms as solving a tractable noise augmented MDP. That perspective emphasizes the basic optimism–estimation trade-off: an optimistic RL algorithm must over-estimate the true value function, but not by so much that estimation error dominates. The same paper reports that, in deep RL, estimation error is significantly more troublesome, even though optimistic model-based methods can match state-of-the-art continuous-control performance when that error is controlled (Pacchiano et al., 2020).

2. Major formulations

The SOMBRL label is applied to distinct but technically connected mechanisms for scalable optimism. The following comparison organizes the principal formulations that appear in the cited literature.

Formulation Core mechanism Representative compatibility
Uncertainty-aware SOMBRL Posterior mean μn\mu_n and uncertainty σn\sigma_n; optimize r(x,u)+λnσn(x,u)r(x,u)+\lambda_n\|\sigma_n(x,u)\| MBPO, Dreamer, SimFSVGD
Optimistic World Models Augment world-model training with an optimistic dynamics loss based on RBMLE DreamerV3, STORM
Noise-augmented optimism Add sampled noise to empirical rewards or transitions to form an optimistic MDP UCBVI-style planning; MBPO-style deep variants

In the uncertainty-aware formulation, SOMBRL fits either a Gaussian process or a small ensemble of neural networks to transition data and records, for each state–action pair z=(x,u)z=(x,u), a mean μn(z)\mu_n(z) and standard deviation σn(z)\sigma_n(z). Planning then uses the learned mean dynamics while augmenting extrinsic reward with an intrinsic term proportional to epistemic uncertainty. The paper emphasizes that this preserves the ordinary model-based planning structure: one solves exactly the same planning problem as greedy model-based RL, but with an augmented reward (Sukhija et al., 25 Nov 2025).

In the Optimistic World Models formulation, optimism is not injected through an explicit uncertainty bonus. Instead, the model-learning objective itself is biased toward high-reward imagined transitions by adding an optimistic dynamics loss. This is described as a plug-and-play instance of scalable optimistic MBRL within existing world-model architectures, preserving scalability while requiring only minimal modifications to standard training procedures. The paper instantiates this design as Optimistic DreamerV3 and Optimistic STORM (Mete et al., 10 Feb 2026).

The noise-augmented formulation constructs an optimistic MDP by perturbing empirical reward and transition estimates with random noise, then selecting the best sample. In the tabular development, the optimistic reward and transition are formed from a small “model-cloud” of noisy samples. In deep RL, the same work studies ensemble-based optimism and shows that naive “pick the model predicting the highest reward/return” can lead to exploitation of inaccurate models unless additional controls are imposed (Pacchiano et al., 2020).

3. Mathematical objectives

A central objective in uncertainty-aware SOMBRL is the optimistic planning criterion

πn=argmaxπΠ  Eπμn,σn[t=0T1r(xt,ut)+λnσn(xt,ut)],\pi_n = \arg\max_{\pi\in\Pi} \; \mathbb E^{\mu_n,\sigma_n}_{\pi} \left[\sum_{t=0}^{T-1} r(x'_t,u_t) + \lambda_n \cdot \|\sigma_n(x'_t,u_t)\|\right],

with shorthand

ff^*0

Here ff^*1 trades off exploration and exploitation, and the expectation is taken under policy ff^*2 and the mean dynamics plus known noise. In the Gaussian-process case, the posterior mean and variance are written componentwise as

ff^*3

and these induce a confidence set of plausible dynamics ff^*4 (Sukhija et al., 25 Nov 2025).

In the world-model formulation, let ff^*5 parameterize dynamics ff^*6, let ff^*7 parameterize the policy ff^*8, and let

ff^*9

denote the standard likelihood component on real transitions. SOMBRL augments this with an optimistic dynamics loss computed on imagined trajectories:

wtw_t0

where

wtw_t1

The total world-model loss is

wtw_t2

The exposition describes this term as pushing the log-probabilities of imagined transitions toward outcomes that yield high imagined return, in a fully gradient-based, uncertainty-free form (Mete et al., 10 Feb 2026).

The same world-model exposition also places this loss within RBMLE. Standard MLE uses

wtw_t3

whereas RBMLE inserts reward bias through

wtw_t4

The text further states that UCB can be viewed as a constrained problem and RBMLE as the Lagrangian dual,

wtw_t5

respectively (Mete et al., 10 Feb 2026).

In the noise-augmented MDP formulation, optimism is produced by sampling noisy rewards or transitions around empirical estimates. For rewards,

wtw_t6

and for transitions,

wtw_t7

with Gaussian noise in the analyzed variant. These are collapsed into a single optimistic reward and transition by maximizing across samples, thereby turning a single empirical MDP into a small “model-cloud” whose best member supplies the optimism (Pacchiano et al., 2020).

4. Algorithmic structure and scalability

The uncertainty-aware SOMBRL workflow is episode-based. At each episode wtw_t8, the method fits or updates an uncertainty-aware model on collected transitions, computes posterior mean wtw_t9 and standard deviation μn\mu_n0, sets an exploration weight μn\mu_n1, defines the optimistic reward

μn\mu_n2

plans or trains a policy μn\mu_n3 to maximize μn\mu_n4 under the learned mean dynamics, and then executes μn\mu_n5 on the real system to collect more data. The paper states that any model-based planner or model-predictive control / policy-optimizer applies, and explicitly names MBPO, Dreamer, and SimFSVGD as compatible examples (Sukhija et al., 25 Nov 2025).

In the world-model instantiation, the training loop alternates between real-data collection, world-model updates, and actor–critic updates in imagination. Standard likelihood or reconstruction terms are computed on minibatches from a replay buffer; imagined rollouts are then generated from states encoded from real data; advantages μn\mu_n6 are computed in imagination; and the optimistic dynamics loss is backpropagated into the world-model parameters μn\mu_n7. Policy and value networks are updated on the same imagined trajectories. For STORM, the recurrent latent state is replaced by a transformer-based encoding, but the optimistic loss is described as structurally identical (Mete et al., 10 Feb 2026).

The scalability claim is explicit in both formulations. The uncertainty-aware paper states that no expensive sampling from or optimization over the full posterior is ever required; only the posterior mean is used for planning, while μn\mu_n8 is treated as an intrinsic reward. It reports that, on state tasks, adding a 5-ensemble uncertainty bonus increases wall-time by μn\mu_n9 over MBPO, and on visual tasks, training 5 small MLPs for σn\sigma_n0 adds σn\sigma_n1 overhead to Dreamer; planning complexity is unchanged because only the reward function is modified (Sukhija et al., 25 Nov 2025).

The Optimistic World Models exposition makes a complementary scalability argument. Everything remains a simple gradient step; there are no inner-loop constrained solves, no high-dimensional uncertainty estimates, and no Gaussian processes. The optimistic loss reuses the same imagined trajectories that the policy update uses, requires zero changes to the base encoder, latent dynamics, decoder, actor, or critic architectures, and avoids $\sigma_n$2 or GP-style bottlenecks. The same text states that this makes SOMBRL immediately compatible with large transformer world models such as STORM, IRIS, and DIAMOND, as well as pixel-based world models such as DreamerV3. For the reported Atari100K MsPacman experiment on an RTX 4090, the marginal wall-clock figures are DreamerV3 baseline 178 min versus O-DreamerV3 115 min (Mete et al., 10 Feb 2026).

5. Theoretical guarantees and optimism properties

The uncertainty-aware SOMBRL paper develops regret guarantees under standard regularity assumptions: continuous dynamics and policies, bounded reward, Gaussian or sub-Gaussian noise, and an RKHS prior in which σn\sigma_n3 lies in the RKHS of kernel σn\sigma_n4 with bounded norm σn\sigma_n5. It defines the confidence set

σn\sigma_n6

with σn\sigma_n7 growing like σn\sigma_n8, where σn\sigma_n9 is the maximum information gain. The paper’s optimism lemma states that, with appropriately chosen r(x,u)+λnσn(x,u)r(x,u)+\lambda_n\|\sigma_n(x,u)\|0 and

r(x,u)+λnσn(x,u)r(x,u)+\lambda_n\|\sigma_n(x,u)\|1

one has, with probability at least r(x,u)+λnσn(x,u)r(x,u)+\lambda_n\|\sigma_n(x,u)\|2, for all policies r(x,u)+λnσn(x,u)r(x,u)+\lambda_n\|\sigma_n(x,u)\|3,

r(x,u)+λnσn(x,u)r(x,u)+\lambda_n\|\sigma_n(x,u)\|4

From this, the paper derives sublinear regret in three settings: finite-horizon regret

r(x,u)+λnσn(x,u)r(x,u)+\lambda_n\|\sigma_n(x,u)\|5

discounted infinite-horizon regret

r(x,u)+λnσn(x,u)r(x,u)+\lambda_n\|\sigma_n(x,u)\|6

and non-episodic average-reward regret

r(x,u)+λnσn(x,u)r(x,u)+\lambda_n\|\sigma_n(x,u)\|7

The text identifies optimism plus information-gain control as the common crux of these results (Sukhija et al., 25 Nov 2025).

The Optimistic World Models exposition gives a different theoretical lineage through RBMLE. It states that one shows consistency and optimism when r(x,u)+λnσn(x,u)r(x,u)+\lambda_n\|\sigma_n(x,u)\|8 but r(x,u)+λnσn(x,u)r(x,u)+\lambda_n\|\sigma_n(x,u)\|9, and it interprets the RBMLE objective as the Lagrangian dual of a UCB-style constrained optimism problem. This places the gradient-based optimistic dynamics loss in direct conceptual continuity with classical reward-biased maximum likelihood estimation from adaptive control, rather than with explicit confidence-set optimization (Mete et al., 10 Feb 2026).

The earlier tractable optimism paper analyzes a Gaussian noise–augmented MDP and proves a competitive regret bound

z=(x,u)z=(x,u)0

Its analysis uses an optimism decomposition

z=(x,u)z=(x,u)1

showing that the first term is nonpositive under the Gaussian anti-concentration construction and bounding the second via concentration and Bellman-error arguments. This result is framed as making optimism tractable in a setting where exact optimistic planning would otherwise be intractable at scale (Pacchiano et al., 2020).

6. Empirical behavior, ablations, and recurring failure modes

The empirical record reported for SOMBRL is concentrated on sparse-reward settings, visual-control benchmarks, and continuous-control tasks. In the uncertainty-aware 2025 formulation, state-based experiments on the DeepMind Control suite use MBPO as the base algorithm and compare against MBPO-Mean and MBPO-PETS. The reported result is that MBPO-Optimistic substantially outperforms both baselines, solves sparse tasks such as MountainCar and CartPole that greedy methods fail, and speeds up Humanoid by approximately z=(x,u)z=(x,u)2 in sample efficiency. In visual control, Dreamer-Optimistic is reported to match or exceed standard Dreamer; on Venture, Dreamer fails entirely while Dreamer-Optimistic solves; and under added action-cost penalties, Dreamer collapses to zero action while the optimistic variant remains robust. On a real RC-car agile parking drift maneuver over 20 real-world episodes, both dense-reward versions succeed, but under a sparser margin only the SOMBRL variant learns reliably while the baseline sticks to trivial policies (Sukhija et al., 25 Nov 2025).

The Optimistic World Models results are reported on Atari100K and DeepMind Control Suite. On Atari100K, DreamerV3 achieves mean human-normalized score 97.45%, whereas O-DreamerV3 reaches 152.68%, described as an approximately 55% relative gain. The same report states that O-DreamerV3 doubles return in Private Eye and matches Montezuma’s with half the samples. For STORM, mean HNS increases from 75.90% to 80.68%, with improvements noted on sparse-reward games including Freeway, Up N Down, and Private Eye; Freeway is reported to move from 0 to +6.4. On DeepMind Control Suite, gains are emphasized in sparse tasks, including Acrobot Swingup Sparse from 8.4 to 34.6 and Cartpole Swingup Sparse from 664 to 747, while dense tasks are unchanged or slightly improved. The ablations report z=(x,u)z=(x,u)3 for both O-DreamerV3 and O-STORM, model entropy weights z=(x,u)z=(x,u)4 for DreamerV3 and z=(x,u)z=(x,u)5 for STORM, and comparable performance for decay schedules z=(x,u)z=(x,u)6 and z=(x,u)z=(x,u)7. They also state that removing the entropy term slows learning in sparse settings, that overly large z=(x,u)z=(x,u)8 or z=(x,u)z=(x,u)9 degrades performance, and that sensitivity plots confirm a robust window μn(z)\mu_n(z)0 (Mete et al., 10 Feb 2026).

The earlier tractable optimism study identifies a recurring failure mode that remains relevant to later SOMBRL discussions: naive optimism in deep ensembles can over-exploit bad models. The paper reports a strong positive correlation between each ensemble member’s validation-set prediction error and the fraction of times it is chosen by the optimism rule, describing the resulting behavior as the controller chasing spurious high-value samples. In its MBPO-style continuous-control experiments, SOMBRL with only μn(z)\mu_n(z)1 models and a tight model-radius μn(z)\mu_n(z)2 solves InvertedPendulum in approximately 1,850 steps versus MBPO’s approximately 2,200, while on Hopper and HalfCheetah it matches or modestly outperforms MBPO despite using 57% fewer models. By contrast, without a model-radius constraint, larger ensembles with μn(z)\mu_n(z)3 catastrophically exploit bad models and learning stalls (Pacchiano et al., 2020).

Taken together, these results support a narrow but consistent interpretation of scalable optimism in model-based RL. Mild optimism can improve exploration materially in sparse-reward domains and can do so without changing the base planner or world-model architecture; however, the same literature also shows that optimism is not automatically beneficial. Excessive optimism, poorly calibrated uncertainty, or naive model selection can amplify estimation error rather than exploration. This suggests that the central technical issue in SOMBRL is not optimism alone, but the balance between optimism and model fidelity that each formulation tries to preserve.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Scalable and Optimistic MBRL (SOMBRL).