Papers
Topics
Authors
Recent
Search
2000 character limit reached

Latent Offline Model-Based Policy Optimization

Updated 22 January 2026
  • LOMPO is a framework for offline RL that integrates latent-space dynamics, uncertainty quantification, and conservative policy optimization to handle complex visual inputs.
  • It employs low-dimensional latent representations and ensemble uncertainty estimation to mitigate overestimation and address challenges from static offline datasets.
  • Empirical results demonstrate that LOMPO outperforms baselines on tasks such as Walker, D’Claw, and real robot scenarios by leveraging latent-space pessimism.

Latent Offline Model-Based Policy Optimization (LOMPO) is a framework for offline reinforcement learning (RL) from high-dimensional visual observations, integrating latent-space dynamics modeling, uncertainty quantification, and conservative policy optimization. Developed in the context of partially observable Markov decision processes (POMDPs), LOMPO addresses the challenges inherent to static datasets and complex observation spaces in offline RL environments.

1. Problem Setting and Formalization

LOMPO operates in a POMDP defined by:

  • Observation space X\mathcal{X} (e.g., images)
  • Latent state space S\mathcal{S}
  • Action space A\mathcal{A}
  • Transition kernel T(ss,a)T(s'|s,a)
  • Observation kernel D(os)D(o|s)
  • Reward function r(s,a)r(s,a)
  • Initial state distribution μ0(s)\mu_0(s), discount γ\gamma.

The agent is provided only with a static dataset:

D={(o1,a1,r1,o2),,(oT1,aT1,rT1,oT)}D = \{(o_1, a_1, r_1, o_2), \dotsc, (o_{T-1}, a_{T-1}, r_{T-1}, o_T)\}

collected under some unknown policy, with no further environment interaction allowed. The objective is to learn π(ao)\pi(a|o) (or, in latent space, π(az)\pi(a|z)) that maximizes the expected discounted return:

η(π)=Eπ[t=0γtr(st,at)]\eta(\pi) = \mathbb{E}_\pi\left[\sum_{t=0}^\infty \gamma^t r(s_t, a_t)\right]

where sts_t is unobserved and otD(st)o_t \sim D(\cdot|s_t) (Rafailov et al., 2020).

2. Latent-Space Dynamics Modeling and Variational Training

LOMPO introduces a low-dimensional latent variable ztz_t to summarize state. The model is composed of:

  • A generative model pθp_\theta:
    • pθ(z1)=p(z1)p_\theta(z_1)=p(z_1) (prior)
    • pθ(ztzt1,at1)p_\theta(z_t|z_{t-1},a_{t-1}) for t2t\geq2 (latent transition)
    • pθ(otzt)p_\theta(o_t|z_t) (observation decoder)
  • An inference network qϕq_\phi:
    • qϕ(z1o1)q_\phi(z_1|o_1)
    • qϕ(ztzt1,at1,ot)q_\phi(z_t|z_{t-1}, a_{t-1}, o_t) for t2t\geq2

The joint likelihood is:

pθ(o1:T,z1:Ta1:T1)=p(z1)t=2Tpθ(ztzt1,at1)t=1Tpθ(otzt)p_\theta(o_{1:T}, z_{1:T}\mid a_{1:T-1}) = p(z_1)\prod_{t=2}^T p_\theta(z_t|z_{t-1}, a_{t-1})\,\prod_{t=1}^T p_\theta(o_t|z_t)

Training is performed by maximizing the variational evidence lower bound (ELBO): L(θ,ϕ)=Ez1:Tqϕ[t=1Tlogpθ(otzt)+t=2Tlogpθ(ztzt1,at1)t=2Tlogqϕ(ztzt1,at1,ot)logqϕ(z1o1)+logp(z1)]\mathcal{L}(\theta, \phi) = \mathbb{E}_{z_{1:T} \sim q_\phi}\left[ \sum_{t=1}^T \log p_\theta(o_t|z_t) + \sum_{t=2}^T \log p_\theta(z_t|z_{t-1}, a_{t-1}) - \sum_{t=2}^T \log q_\phi(z_t|z_{t-1},a_{t-1},o_t) - \log q_\phi(z_1|o_1) + \log p(z_1) \right] In practice,

L(θ,ϕ)t=1TEqϕ[logpθ(otzt)]t=2TEqϕ[KL(qϕ(ztzt1,at1,ot)pθ(ztzt1,at1))]KL(qϕ(z1o1)p(z1))\mathcal{L}(\theta,\phi) \simeq \sum_{t=1}^T \mathbb{E}_{q_\phi}[\log p_\theta(o_t|z_t)] - \sum_{t=2}^T \mathbb{E}_{q_\phi}[KL(q_\phi(z_t|z_{t-1},a_{t-1},o_t)\Vert p_\theta(z_t|z_{t-1},a_{t-1}))] - KL(q_\phi(z_1|o_1)\Vert p(z_1))

(Rafailov et al., 2020).

3. Latent-Space Uncertainty Quantification

To mitigate compounding model errors and value overestimation on out-of-distribution (OOD) samples, LOMPO employs an ensemble of KK latent transition models {Tθ1,,TθK}\{T_{\theta_1}, \dotsc, T_{\theta_K}\}.

Uncertainty u(z,a)u(z,a) can be estimated as either:

  • The variance of ensemble means:

u(z,a)=1Kk=1Kμk(z,a)μˉ(z,a)2,μˉ=1Kkμku(z,a) = \frac{1}{K}\sum_{k=1}^K \big\|\mu_k(z,a) - \bar\mu(z,a)\big\|^2, \quad \bar\mu = \frac{1}{K} \sum_k \mu_k

  • The variance of log-likelihoods:

u(z,a)=Vark=1..K[logTθk(zz,a)]u(z,a) = \operatorname{Var}_{k=1..K}[\log T_{\theta_k}(z'|z,a)]

These estimates provide a bound on one-step model error under certain conditions (Rafailov et al., 2020).

4. Policy Optimization via Pessimism in Latent Space

LOMPO constructs a penalized latent MDP:

M~=(S,A,T^,r~,μ0,γ)\widetilde{\mathcal{M}} = (\mathcal{S}, \mathcal{A}, \hat{T}, \tilde{r}, \mu_0, \gamma)

with pessimistic rewards: r~(z,a)=EzT^(zz,a)[r^(z,a)]λu(z,a)\tilde{r}(z,a) = \mathbb{E}_{z' \sim \hat{T}(z'|z,a)}[\hat{r}(z,a)] - \lambda u(z,a) where λ\lambda controls the conservatism. The actor πψ(az)\pi_\psi(a|z) and critic Qψ(z,a)Q_\psi(z,a) are both parameterized and optimized using an off-policy algorithm in latent space (e.g., Soft Actor-Critic-style updates):

  • Roll out trajectories in latent space under the ensemble model starting from z1qϕ(o1)z_1 \sim q_\phi(o_1).
  • Form a model buffer with transitions (zt,at,r~t,zt+1)(z_t, a_t, \tilde{r}_t, z_{t+1}).
  • Mix these with "real" transitions (zt,at,rt,zt+1)(z_t, a_t, r_t, z_{t+1}) inferred via qϕq_\phi.
  • Policy and critic parameters are updated via corresponding loss functions:

LQ=E(z,a,r~,z)[(Qψ(z,a)[r~+γEaπψQψ(z,a)])2]\mathcal{L}_Q = \mathbb{E}_{(z,a,\tilde{r},z')}\bigg[(Q_\psi(z,a) - [\tilde{r} + \gamma \mathbb{E}_{a'\sim\pi_\psi} Q_\psi(z',a')])^2\bigg]

Lπ=Ez[Eaπψ[αlogπψ(az)Qψ(z,a)]]\mathcal{L}_\pi = \mathbb{E}_{z}\left[\mathbb{E}_{a\sim\pi_\psi}\left[\alpha \log\pi_\psi(a|z) - Q_\psi(z,a)\right]\right]

The λu(z,a)\lambda u(z,a) penalty enforces conservatism, discouraging the policy from selecting actions in regions where the latent model is uncertain (Rafailov et al., 2020).

5. Algorithmic Summary

A high-level workflow for LOMPO:

Step Description Output
1 Process dataset through qϕq_\phi to obtain ztz_t "Real" replay buffer Breal\mathbb{B}_\text{real}
2 Train (θ,ϕ)(\theta,\phi) by maximizing ELBO Trained latent model and encoder
3 Initialize policy, critic, model buffer Bmodel\mathbb{B}_\text{model} Ready for policy optimization
4 Generate imaginary rollouts in latent space using ensemble; compute r~t\tilde{r}_t; store in Bmodel\mathbb{B}_\text{model} Experienced transitions
5 Alternate updates: sample mixed minibatches from BrealBmodel\mathbb{B}_\text{real} \cup \mathbb{B}_\text{model} and perform gradient steps on policy and critic loss Updated actor-critic

All computation is performed without further environment interaction, compatible with the fully offline setting.

6. Experimental Results

LOMPO is empirically evaluated on:

  • DeepMind Control Suite (Walker Walk, 64×64 RGB)
  • ROBEL (D’Claw Screw, 128×128 + proprioception)
  • Adroit suite (Adroit Pen, 128×128 + proprioception)
  • Meta-World Sawyer Door Open
  • Real Franka Panda drawer closing (64×64 overhead)

Offline datasets include “medium-replay,” “medium-expert,” “expert” variants. Baselines include Behavior Cloning (BC), Conservative Q-Learning (CQL), MBPO-style latent rollouts (LMBRL), Offline SLAC, and Visual Foresight (real robot).

Reported results (normalized return 0–100):

Task LOMPO LMBRL CQL BC
Walker 74.9 44.7 14.7 5.3
D'Claw 71.8 72.4 26.3 11.7
Adroit Pen 82.8 5.2 25.8 46.7
Sawyer Door 95.7 0.0 0.0 0.0

Real robot drawer-closing: LOMPO achieves a 76% success rate; all baselines achieve 0%. Across all tasks, LOMPO's latent-space uncertainty and pessimism mechanisms yield substantially higher performance versus model-free, online visual model-based, and naive behavior-cloning methods (Rafailov et al., 2020).

LOMPO represents a conservative, model-based approach for offline RL with visual inputs, specifically addressing the gap between tractable policy optimization and the need for robust generalization from static data. Its methodology of latent-space uncertainty estimation via ensembles effectively mitigates overestimation and exploitation of imperfect dynamics models, a key issue in offline RL.

Recent extensions such as Constrained Latent Action Policies (C-LAP) (Alles et al., 2024) further explore latent joint modeling, but LOMPO introduced the core paradigm of uncertainty-driven pessimism within latent-space model-based policy optimization for pixel-level input tasks. A plausible implication is that as the dimensionality of observation spaces increases, the latent uncertainty quantification and penalization framework will be a central design element for scalable visual offline RL.

No evidence is presented regarding LOMPO’s applicability to continuous online adaptation, nonstationary environments, or transfer learning. Its reliance on offline data quality and distribution coverage remains a potential limitation, as does the scaling of the ensemble for uncertainty estimation in extremely high-dimensional latent spaces.

LOMPO’s approach has established a key methodological component—latent-space ensemble pessimism—in the model-based offline RL landscape, with continued influence in subsequent developments in both state-based and image-based offline learning (Rafailov et al., 2020).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Latent Offline Model-Based Policy Optimization (LOMPO).