Latent Offline Model-Based Policy Optimization

Updated 22 January 2026

LOMPO is a framework for offline RL that integrates latent-space dynamics, uncertainty quantification, and conservative policy optimization to handle complex visual inputs.
It employs low-dimensional latent representations and ensemble uncertainty estimation to mitigate overestimation and address challenges from static offline datasets.
Empirical results demonstrate that LOMPO outperforms baselines on tasks such as Walker, D’Claw, and real robot scenarios by leveraging latent-space pessimism.

Latent Offline Model-Based Policy Optimization (LOMPO) is a framework for offline reinforcement learning (RL) from high-dimensional visual observations, integrating latent-space dynamics modeling, uncertainty quantification, and conservative policy optimization. Developed in the context of partially observable Markov decision processes (POMDPs), LOMPO addresses the challenges inherent to static datasets and complex observation spaces in offline RL environments.

1. Problem Setting and Formalization

LOMPO operates in a POMDP defined by:

Observation space $\mathcal{X}$ (e.g., images)
Latent state space $\mathcal{S}$
Action space $\mathcal{A}$
Transition kernel $T(s'|s,a)$
Observation kernel $D(o|s)$
Reward function $r(s,a)$
Initial state distribution $\mu_0(s)$ , discount $\gamma$ .

The agent is provided only with a static dataset:

$D = \{(o_1, a_1, r_1, o_2), \dotsc, (o_{T-1}, a_{T-1}, r_{T-1}, o_T)\}$

collected under some unknown policy, with no further environment interaction allowed. The objective is to learn $\pi(a|o)$ (or, in latent space, $\pi(a|z)$ ) that maximizes the expected discounted return:

$\eta(\pi) = \mathbb{E}_\pi\left[\sum_{t=0}^\infty \gamma^t r(s_t, a_t)\right]$

where $s_t$ is unobserved and $o_t \sim D(\cdot|s_t)$ (Rafailov et al., 2020).

2. Latent-Space Dynamics Modeling and Variational Training

LOMPO introduces a low-dimensional latent variable $z_t$ to summarize state. The model is composed of:

A generative model $p_\theta$ $p_{θ}$ :
- $p_\theta(z_1)=p(z_1)$ (prior)
- $p_\theta(z_t|z_{t-1},a_{t-1})$ for $t\geq2$ (latent transition)
- $p_\theta(o_t|z_t)$ (observation decoder)
An inference network $q_\phi$ $q_{ϕ}$ :
- $q_\phi(z_1|o_1)$
- $q_\phi(z_t|z_{t-1}, a_{t-1}, o_t)$ for $t\geq2$

The joint likelihood is:

$p_\theta(o_{1:T}, z_{1:T}\mid a_{1:T-1}) = p(z_1)\prod_{t=2}^T p_\theta(z_t|z_{t-1}, a_{t-1})\,\prod_{t=1}^T p_\theta(o_t|z_t)$

Training is performed by maximizing the variational evidence lower bound (ELBO): $\mathcal{L}(\theta, \phi) = \mathbb{E}_{z_{1:T} \sim q_\phi}\left[ \sum_{t=1}^T \log p_\theta(o_t|z_t) + \sum_{t=2}^T \log p_\theta(z_t|z_{t-1}, a_{t-1}) - \sum_{t=2}^T \log q_\phi(z_t|z_{t-1},a_{t-1},o_t) - \log q_\phi(z_1|o_1) + \log p(z_1) \right]$ In practice,

$\mathcal{L}(\theta,\phi) \simeq \sum_{t=1}^T \mathbb{E}_{q_\phi}[\log p_\theta(o_t|z_t)] - \sum_{t=2}^T \mathbb{E}_{q_\phi}[KL(q_\phi(z_t|z_{t-1},a_{t-1},o_t)\Vert p_\theta(z_t|z_{t-1},a_{t-1}))] - KL(q_\phi(z_1|o_1)\Vert p(z_1))$

(Rafailov et al., 2020).

3. Latent-Space Uncertainty Quantification

To mitigate compounding model errors and value overestimation on out-of-distribution (OOD) samples, LOMPO employs an ensemble of $K$ latent transition models $\{T_{\theta_1}, \dotsc, T_{\theta_K}\}$ .

Uncertainty $u(z,a)$ can be estimated as either:

The variance of ensemble means:

$u(z,a) = \frac{1}{K}\sum_{k=1}^K \big\|\mu_k(z,a) - \bar\mu(z,a)\big\|^2, \quad \bar\mu = \frac{1}{K} \sum_k \mu_k$

The variance of log-likelihoods:

$u(z,a) = \operatorname{Var}_{k=1..K}[\log T_{\theta_k}(z'|z,a)]$

These estimates provide a bound on one-step model error under certain conditions (Rafailov et al., 2020).

4. Policy Optimization via Pessimism in Latent Space

LOMPO constructs a penalized latent MDP:

$\widetilde{\mathcal{M}} = (\mathcal{S}, \mathcal{A}, \hat{T}, \tilde{r}, \mu_0, \gamma)$

with pessimistic rewards: $\tilde{r}(z,a) = \mathbb{E}_{z' \sim \hat{T}(z'|z,a)}[\hat{r}(z,a)] - \lambda u(z,a)$ where $\lambda$ controls the conservatism. The actor $\pi_\psi(a|z)$ and critic $Q_\psi(z,a)$ are both parameterized and optimized using an off-policy algorithm in latent space (e.g., Soft Actor-Critic-style updates):

Roll out trajectories in latent space under the ensemble model starting from $z_1 \sim q_\phi(o_1)$ .
Form a model buffer with transitions $(z_t, a_t, \tilde{r}_t, z_{t+1})$ .
Mix these with "real" transitions $(z_t, a_t, r_t, z_{t+1})$ inferred via $q_\phi$ .
Policy and critic parameters are updated via corresponding loss functions:

$\mathcal{L}_Q = \mathbb{E}_{(z,a,\tilde{r},z')}\bigg[(Q_\psi(z,a) - [\tilde{r} + \gamma \mathbb{E}_{a'\sim\pi_\psi} Q_\psi(z',a')])^2\bigg]$

$\mathcal{L}_\pi = \mathbb{E}_{z}\left[\mathbb{E}_{a\sim\pi_\psi}\left[\alpha \log\pi_\psi(a|z) - Q_\psi(z,a)\right]\right]$

The $\lambda u(z,a)$ penalty enforces conservatism, discouraging the policy from selecting actions in regions where the latent model is uncertain (Rafailov et al., 2020).

5. Algorithmic Summary

A high-level workflow for LOMPO:

Step	Description	Output
1	Process dataset through $q_\phi$ to obtain $z_t$	"Real" replay buffer $\mathbb{B}_\text{real}$
2	Train $(\theta,\phi)$ by maximizing ELBO	Trained latent model and encoder
3	Initialize policy, critic, model buffer $\mathbb{B}_\text{model}$	Ready for policy optimization
4	Generate imaginary rollouts in latent space using ensemble; compute $\tilde{r}_t$ ; store in $\mathbb{B}_\text{model}$	Experienced transitions
5	Alternate updates: sample mixed minibatches from $\mathbb{B}_\text{real} \cup \mathbb{B}_\text{model}$ and perform gradient steps on policy and critic loss	Updated actor-critic

All computation is performed without further environment interaction, compatible with the fully offline setting.

6. Experimental Results

LOMPO is empirically evaluated on:

DeepMind Control Suite (Walker Walk, 64×64 RGB)
ROBEL (D’Claw Screw, 128×128 + proprioception)
Adroit suite (Adroit Pen, 128×128 + proprioception)
Meta-World Sawyer Door Open
Real Franka Panda drawer closing (64×64 overhead)

Offline datasets include “medium-replay,” “medium-expert,” “expert” variants. Baselines include Behavior Cloning (BC), Conservative Q-Learning (CQL), MBPO-style latent rollouts (LMBRL), Offline SLAC, and Visual Foresight (real robot).

Reported results (normalized return 0–100):

Task	LOMPO	LMBRL	CQL	BC
Walker	74.9	44.7	14.7	5.3
D'Claw	71.8	72.4	26.3	11.7
Adroit Pen	82.8	5.2	25.8	46.7
Sawyer Door	95.7	0.0	0.0	0.0

Real robot drawer-closing: LOMPO achieves a 76% success rate; all baselines achieve 0%. Across all tasks, LOMPO's latent-space uncertainty and pessimism mechanisms yield substantially higher performance versus model-free, online visual model-based, and naive behavior-cloning methods (Rafailov et al., 2020).

LOMPO represents a conservative, model-based approach for offline RL with visual inputs, specifically addressing the gap between tractable policy optimization and the need for robust generalization from static data. Its methodology of latent-space uncertainty estimation via ensembles effectively mitigates overestimation and exploitation of imperfect dynamics models, a key issue in offline RL.

Recent extensions such as Constrained Latent Action Policies (C-LAP) (Alles et al., 2024) further explore latent joint modeling, but LOMPO introduced the core paradigm of uncertainty-driven pessimism within latent-space model-based policy optimization for pixel-level input tasks. A plausible implication is that as the dimensionality of observation spaces increases, the latent uncertainty quantification and penalization framework will be a central design element for scalable visual offline RL.

No evidence is presented regarding LOMPO’s applicability to continuous online adaptation, nonstationary environments, or transfer learning. Its reliance on offline data quality and distribution coverage remains a potential limitation, as does the scaling of the ensemble for uncertainty estimation in extremely high-dimensional latent spaces.

LOMPO’s approach has established a key methodological component—latent-space ensemble pessimism—in the model-based offline RL landscape, with continued influence in subsequent developments in both state-based and image-based offline learning (Rafailov et al., 2020).

Markdown Report Issue Upgrade to Chat

References (2)

Offline Reinforcement Learning from Images with Latent Space Models (2020)

Constrained Latent Action Policies for Model-Based Offline Reinforcement Learning (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Latent Offline Model-Based Policy Optimization (LOMPO).

Latent Offline Model-Based Policy Optimization

1. Problem Setting and Formalization

2. Latent-Space Dynamics Modeling and Variational Training

3. Latent-Space Uncertainty Quantification

4. Policy Optimization via Pessimism in Latent Space

5. Algorithmic Summary

6. Experimental Results

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Latent Offline Model-Based Policy Optimization

1. Problem Setting and Formalization

2. Latent-Space Dynamics Modeling and Variational Training

3. Latent-Space Uncertainty Quantification

4. Policy Optimization via Pessimism in Latent Space

5. Algorithmic Summary

6. Experimental Results

7. Context, Limitations, and Related Developments

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research