Latent Offline Model-Based Policy Optimization
- LOMPO is a framework for offline RL that integrates latent-space dynamics, uncertainty quantification, and conservative policy optimization to handle complex visual inputs.
- It employs low-dimensional latent representations and ensemble uncertainty estimation to mitigate overestimation and address challenges from static offline datasets.
- Empirical results demonstrate that LOMPO outperforms baselines on tasks such as Walker, D’Claw, and real robot scenarios by leveraging latent-space pessimism.
Latent Offline Model-Based Policy Optimization (LOMPO) is a framework for offline reinforcement learning (RL) from high-dimensional visual observations, integrating latent-space dynamics modeling, uncertainty quantification, and conservative policy optimization. Developed in the context of partially observable Markov decision processes (POMDPs), LOMPO addresses the challenges inherent to static datasets and complex observation spaces in offline RL environments.
1. Problem Setting and Formalization
LOMPO operates in a POMDP defined by:
- Observation space (e.g., images)
- Latent state space
- Action space
- Transition kernel
- Observation kernel
- Reward function
- Initial state distribution , discount .
The agent is provided only with a static dataset:
collected under some unknown policy, with no further environment interaction allowed. The objective is to learn (or, in latent space, ) that maximizes the expected discounted return:
where is unobserved and (Rafailov et al., 2020).
2. Latent-Space Dynamics Modeling and Variational Training
LOMPO introduces a low-dimensional latent variable to summarize state. The model is composed of:
- A generative model :
- (prior)
- for (latent transition)
- (observation decoder)
- An inference network :
- for
The joint likelihood is:
Training is performed by maximizing the variational evidence lower bound (ELBO): In practice,
3. Latent-Space Uncertainty Quantification
To mitigate compounding model errors and value overestimation on out-of-distribution (OOD) samples, LOMPO employs an ensemble of latent transition models .
Uncertainty can be estimated as either:
- The variance of ensemble means:
- The variance of log-likelihoods:
These estimates provide a bound on one-step model error under certain conditions (Rafailov et al., 2020).
4. Policy Optimization via Pessimism in Latent Space
LOMPO constructs a penalized latent MDP:
with pessimistic rewards: where controls the conservatism. The actor and critic are both parameterized and optimized using an off-policy algorithm in latent space (e.g., Soft Actor-Critic-style updates):
- Roll out trajectories in latent space under the ensemble model starting from .
- Form a model buffer with transitions .
- Mix these with "real" transitions inferred via .
- Policy and critic parameters are updated via corresponding loss functions:
The penalty enforces conservatism, discouraging the policy from selecting actions in regions where the latent model is uncertain (Rafailov et al., 2020).
5. Algorithmic Summary
A high-level workflow for LOMPO:
| Step | Description | Output |
|---|---|---|
| 1 | Process dataset through to obtain | "Real" replay buffer |
| 2 | Train by maximizing ELBO | Trained latent model and encoder |
| 3 | Initialize policy, critic, model buffer | Ready for policy optimization |
| 4 | Generate imaginary rollouts in latent space using ensemble; compute ; store in | Experienced transitions |
| 5 | Alternate updates: sample mixed minibatches from and perform gradient steps on policy and critic loss | Updated actor-critic |
All computation is performed without further environment interaction, compatible with the fully offline setting.
6. Experimental Results
LOMPO is empirically evaluated on:
- DeepMind Control Suite (Walker Walk, 64×64 RGB)
- ROBEL (D’Claw Screw, 128×128 + proprioception)
- Adroit suite (Adroit Pen, 128×128 + proprioception)
- Meta-World Sawyer Door Open
- Real Franka Panda drawer closing (64×64 overhead)
Offline datasets include “medium-replay,” “medium-expert,” “expert” variants. Baselines include Behavior Cloning (BC), Conservative Q-Learning (CQL), MBPO-style latent rollouts (LMBRL), Offline SLAC, and Visual Foresight (real robot).
Reported results (normalized return 0–100):
| Task | LOMPO | LMBRL | CQL | BC |
|---|---|---|---|---|
| Walker | 74.9 | 44.7 | 14.7 | 5.3 |
| D'Claw | 71.8 | 72.4 | 26.3 | 11.7 |
| Adroit Pen | 82.8 | 5.2 | 25.8 | 46.7 |
| Sawyer Door | 95.7 | 0.0 | 0.0 | 0.0 |
Real robot drawer-closing: LOMPO achieves a 76% success rate; all baselines achieve 0%. Across all tasks, LOMPO's latent-space uncertainty and pessimism mechanisms yield substantially higher performance versus model-free, online visual model-based, and naive behavior-cloning methods (Rafailov et al., 2020).
7. Context, Limitations, and Related Developments
LOMPO represents a conservative, model-based approach for offline RL with visual inputs, specifically addressing the gap between tractable policy optimization and the need for robust generalization from static data. Its methodology of latent-space uncertainty estimation via ensembles effectively mitigates overestimation and exploitation of imperfect dynamics models, a key issue in offline RL.
Recent extensions such as Constrained Latent Action Policies (C-LAP) (Alles et al., 2024) further explore latent joint modeling, but LOMPO introduced the core paradigm of uncertainty-driven pessimism within latent-space model-based policy optimization for pixel-level input tasks. A plausible implication is that as the dimensionality of observation spaces increases, the latent uncertainty quantification and penalization framework will be a central design element for scalable visual offline RL.
No evidence is presented regarding LOMPO’s applicability to continuous online adaptation, nonstationary environments, or transfer learning. Its reliance on offline data quality and distribution coverage remains a potential limitation, as does the scaling of the ensemble for uncertainty estimation in extremely high-dimensional latent spaces.
LOMPO’s approach has established a key methodological component—latent-space ensemble pessimism—in the model-based offline RL landscape, with continued influence in subsequent developments in both state-based and image-based offline learning (Rafailov et al., 2020).