Scalable and Optimistic MBRL (SOMBRL)
- SOMBRL is an advanced model-based RL approach that injects optimism to improve exploration and sample efficiency in sparse-reward settings.
- It integrates three formulations—uncertainty-aware modeling, optimistic world models with RBMLE, and noise-augmented MDPs—to balance exploration and exploitation.
- Empirical results show that SOMBRL enhances convergence speed and performance on continuous-control and visual tasks while maintaining scalability with deep architectures.
Scalable and Optimistic Model-Based Reinforcement Learning (SOMBRL) is an approach to efficient exploration in model-based reinforcement learning under unknown dynamics, where the agent learns directly from online interactions and applies optimism in the face of uncertainty in a form intended to remain compatible with scalable planners, policy optimizers, and deep world-model architectures. In the literature considered here, SOMBRL appears in at least three closely related formulations: an uncertainty-aware dynamics-model framework with an intrinsic uncertainty bonus, a world-model instantiation based on Reward-Biased Maximum Likelihood Estimation (RBMLE) and an optimistic dynamics loss, and an earlier tractable optimism formulation based on a noise-augmented Markov decision process (Sukhija et al., 25 Nov 2025, Mete et al., 10 Feb 2026, Pacchiano et al., 2020).
1. Problem setting and motivation
Efficient exploration is a central challenge in reinforcement learning, particularly in sparse-reward environments. In model-based RL, the difficulty is acute because the agent must simultaneously identify unknown dynamics and exploit the current model for planning or policy optimization. One formulation writes the system as
with unknown and process noise that is treated as zero-mean Gaussian or sub-Gaussian in the theoretical development. This setting motivates optimism-based methods that prefer actions whose consequences are either high-reward or insufficiently known (Sukhija et al., 25 Nov 2025).
In large world-model systems, the same issue appears through certainty-equivalence training. Architectures such as Dreamer, STORM, and DIAMOND learn a dynamics model by maximum likelihood on real data and then learn a policy in imagination. Because the model is “certain” to its current estimate, such systems can get stuck in poorly explored regions; the exposition on Optimistic World Models identifies this as the closed-loop identification problem. Classical OFU and UCB approaches seek exploration through confidence sets, but in deep RL they require nonconvex constraints or explicit uncertainty estimation, which makes them difficult to scale directly (Mete et al., 10 Feb 2026).
A related earlier line of work reinterprets scalable optimistic model-based algorithms as solving a tractable noise augmented MDP. That perspective emphasizes the basic optimism–estimation trade-off: an optimistic RL algorithm must over-estimate the true value function, but not by so much that estimation error dominates. The same paper reports that, in deep RL, estimation error is significantly more troublesome, even though optimistic model-based methods can match state-of-the-art continuous-control performance when that error is controlled (Pacchiano et al., 2020).
2. Major formulations
The SOMBRL label is applied to distinct but technically connected mechanisms for scalable optimism. The following comparison organizes the principal formulations that appear in the cited literature.
| Formulation | Core mechanism | Representative compatibility |
|---|---|---|
| Uncertainty-aware SOMBRL | Posterior mean and uncertainty ; optimize | MBPO, Dreamer, SimFSVGD |
| Optimistic World Models | Augment world-model training with an optimistic dynamics loss based on RBMLE | DreamerV3, STORM |
| Noise-augmented optimism | Add sampled noise to empirical rewards or transitions to form an optimistic MDP | UCBVI-style planning; MBPO-style deep variants |
In the uncertainty-aware formulation, SOMBRL fits either a Gaussian process or a small ensemble of neural networks to transition data and records, for each state–action pair , a mean and standard deviation . Planning then uses the learned mean dynamics while augmenting extrinsic reward with an intrinsic term proportional to epistemic uncertainty. The paper emphasizes that this preserves the ordinary model-based planning structure: one solves exactly the same planning problem as greedy model-based RL, but with an augmented reward (Sukhija et al., 25 Nov 2025).
In the Optimistic World Models formulation, optimism is not injected through an explicit uncertainty bonus. Instead, the model-learning objective itself is biased toward high-reward imagined transitions by adding an optimistic dynamics loss. This is described as a plug-and-play instance of scalable optimistic MBRL within existing world-model architectures, preserving scalability while requiring only minimal modifications to standard training procedures. The paper instantiates this design as Optimistic DreamerV3 and Optimistic STORM (Mete et al., 10 Feb 2026).
The noise-augmented formulation constructs an optimistic MDP by perturbing empirical reward and transition estimates with random noise, then selecting the best sample. In the tabular development, the optimistic reward and transition are formed from a small “model-cloud” of noisy samples. In deep RL, the same work studies ensemble-based optimism and shows that naive “pick the model predicting the highest reward/return” can lead to exploitation of inaccurate models unless additional controls are imposed (Pacchiano et al., 2020).
3. Mathematical objectives
A central objective in uncertainty-aware SOMBRL is the optimistic planning criterion
with shorthand
0
Here 1 trades off exploration and exploitation, and the expectation is taken under policy 2 and the mean dynamics plus known noise. In the Gaussian-process case, the posterior mean and variance are written componentwise as
3
and these induce a confidence set of plausible dynamics 4 (Sukhija et al., 25 Nov 2025).
In the world-model formulation, let 5 parameterize dynamics 6, let 7 parameterize the policy 8, and let
9
denote the standard likelihood component on real transitions. SOMBRL augments this with an optimistic dynamics loss computed on imagined trajectories:
0
where
1
The total world-model loss is
2
The exposition describes this term as pushing the log-probabilities of imagined transitions toward outcomes that yield high imagined return, in a fully gradient-based, uncertainty-free form (Mete et al., 10 Feb 2026).
The same world-model exposition also places this loss within RBMLE. Standard MLE uses
3
whereas RBMLE inserts reward bias through
4
The text further states that UCB can be viewed as a constrained problem and RBMLE as the Lagrangian dual,
5
respectively (Mete et al., 10 Feb 2026).
In the noise-augmented MDP formulation, optimism is produced by sampling noisy rewards or transitions around empirical estimates. For rewards,
6
and for transitions,
7
with Gaussian noise in the analyzed variant. These are collapsed into a single optimistic reward and transition by maximizing across samples, thereby turning a single empirical MDP into a small “model-cloud” whose best member supplies the optimism (Pacchiano et al., 2020).
4. Algorithmic structure and scalability
The uncertainty-aware SOMBRL workflow is episode-based. At each episode 8, the method fits or updates an uncertainty-aware model on collected transitions, computes posterior mean 9 and standard deviation 0, sets an exploration weight 1, defines the optimistic reward
2
plans or trains a policy 3 to maximize 4 under the learned mean dynamics, and then executes 5 on the real system to collect more data. The paper states that any model-based planner or model-predictive control / policy-optimizer applies, and explicitly names MBPO, Dreamer, and SimFSVGD as compatible examples (Sukhija et al., 25 Nov 2025).
In the world-model instantiation, the training loop alternates between real-data collection, world-model updates, and actor–critic updates in imagination. Standard likelihood or reconstruction terms are computed on minibatches from a replay buffer; imagined rollouts are then generated from states encoded from real data; advantages 6 are computed in imagination; and the optimistic dynamics loss is backpropagated into the world-model parameters 7. Policy and value networks are updated on the same imagined trajectories. For STORM, the recurrent latent state is replaced by a transformer-based encoding, but the optimistic loss is described as structurally identical (Mete et al., 10 Feb 2026).
The scalability claim is explicit in both formulations. The uncertainty-aware paper states that no expensive sampling from or optimization over the full posterior is ever required; only the posterior mean is used for planning, while 8 is treated as an intrinsic reward. It reports that, on state tasks, adding a 5-ensemble uncertainty bonus increases wall-time by 9 over MBPO, and on visual tasks, training 5 small MLPs for 0 adds 1 overhead to Dreamer; planning complexity is unchanged because only the reward function is modified (Sukhija et al., 25 Nov 2025).
The Optimistic World Models exposition makes a complementary scalability argument. Everything remains a simple gradient step; there are no inner-loop constrained solves, no high-dimensional uncertainty estimates, and no Gaussian processes. The optimistic loss reuses the same imagined trajectories that the policy update uses, requires zero changes to the base encoder, latent dynamics, decoder, actor, or critic architectures, and avoids $\sigma_n$2 or GP-style bottlenecks. The same text states that this makes SOMBRL immediately compatible with large transformer world models such as STORM, IRIS, and DIAMOND, as well as pixel-based world models such as DreamerV3. For the reported Atari100K MsPacman experiment on an RTX 4090, the marginal wall-clock figures are DreamerV3 baseline 178 min versus O-DreamerV3 115 min (Mete et al., 10 Feb 2026).
5. Theoretical guarantees and optimism properties
The uncertainty-aware SOMBRL paper develops regret guarantees under standard regularity assumptions: continuous dynamics and policies, bounded reward, Gaussian or sub-Gaussian noise, and an RKHS prior in which 3 lies in the RKHS of kernel 4 with bounded norm 5. It defines the confidence set
6
with 7 growing like 8, where 9 is the maximum information gain. The paper’s optimism lemma states that, with appropriately chosen 0 and
1
one has, with probability at least 2, for all policies 3,
4
From this, the paper derives sublinear regret in three settings: finite-horizon regret
5
discounted infinite-horizon regret
6
and non-episodic average-reward regret
7
The text identifies optimism plus information-gain control as the common crux of these results (Sukhija et al., 25 Nov 2025).
The Optimistic World Models exposition gives a different theoretical lineage through RBMLE. It states that one shows consistency and optimism when 8 but 9, and it interprets the RBMLE objective as the Lagrangian dual of a UCB-style constrained optimism problem. This places the gradient-based optimistic dynamics loss in direct conceptual continuity with classical reward-biased maximum likelihood estimation from adaptive control, rather than with explicit confidence-set optimization (Mete et al., 10 Feb 2026).
The earlier tractable optimism paper analyzes a Gaussian noise–augmented MDP and proves a competitive regret bound
0
Its analysis uses an optimism decomposition
1
showing that the first term is nonpositive under the Gaussian anti-concentration construction and bounding the second via concentration and Bellman-error arguments. This result is framed as making optimism tractable in a setting where exact optimistic planning would otherwise be intractable at scale (Pacchiano et al., 2020).
6. Empirical behavior, ablations, and recurring failure modes
The empirical record reported for SOMBRL is concentrated on sparse-reward settings, visual-control benchmarks, and continuous-control tasks. In the uncertainty-aware 2025 formulation, state-based experiments on the DeepMind Control suite use MBPO as the base algorithm and compare against MBPO-Mean and MBPO-PETS. The reported result is that MBPO-Optimistic substantially outperforms both baselines, solves sparse tasks such as MountainCar and CartPole that greedy methods fail, and speeds up Humanoid by approximately 2 in sample efficiency. In visual control, Dreamer-Optimistic is reported to match or exceed standard Dreamer; on Venture, Dreamer fails entirely while Dreamer-Optimistic solves; and under added action-cost penalties, Dreamer collapses to zero action while the optimistic variant remains robust. On a real RC-car agile parking drift maneuver over 20 real-world episodes, both dense-reward versions succeed, but under a sparser margin only the SOMBRL variant learns reliably while the baseline sticks to trivial policies (Sukhija et al., 25 Nov 2025).
The Optimistic World Models results are reported on Atari100K and DeepMind Control Suite. On Atari100K, DreamerV3 achieves mean human-normalized score 97.45%, whereas O-DreamerV3 reaches 152.68%, described as an approximately 55% relative gain. The same report states that O-DreamerV3 doubles return in Private Eye and matches Montezuma’s with half the samples. For STORM, mean HNS increases from 75.90% to 80.68%, with improvements noted on sparse-reward games including Freeway, Up N Down, and Private Eye; Freeway is reported to move from 0 to +6.4. On DeepMind Control Suite, gains are emphasized in sparse tasks, including Acrobot Swingup Sparse from 8.4 to 34.6 and Cartpole Swingup Sparse from 664 to 747, while dense tasks are unchanged or slightly improved. The ablations report 3 for both O-DreamerV3 and O-STORM, model entropy weights 4 for DreamerV3 and 5 for STORM, and comparable performance for decay schedules 6 and 7. They also state that removing the entropy term slows learning in sparse settings, that overly large 8 or 9 degrades performance, and that sensitivity plots confirm a robust window 0 (Mete et al., 10 Feb 2026).
The earlier tractable optimism study identifies a recurring failure mode that remains relevant to later SOMBRL discussions: naive optimism in deep ensembles can over-exploit bad models. The paper reports a strong positive correlation between each ensemble member’s validation-set prediction error and the fraction of times it is chosen by the optimism rule, describing the resulting behavior as the controller chasing spurious high-value samples. In its MBPO-style continuous-control experiments, SOMBRL with only 1 models and a tight model-radius 2 solves InvertedPendulum in approximately 1,850 steps versus MBPO’s approximately 2,200, while on Hopper and HalfCheetah it matches or modestly outperforms MBPO despite using 57% fewer models. By contrast, without a model-radius constraint, larger ensembles with 3 catastrophically exploit bad models and learning stalls (Pacchiano et al., 2020).
Taken together, these results support a narrow but consistent interpretation of scalable optimism in model-based RL. Mild optimism can improve exploration materially in sparse-reward domains and can do so without changing the base planner or world-model architecture; however, the same literature also shows that optimism is not automatically beneficial. Excessive optimism, poorly calibrated uncertainty, or naive model selection can amplify estimation error rather than exploration. This suggests that the central technical issue in SOMBRL is not optimism alone, but the balance between optimism and model fidelity that each formulation tries to preserve.