Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Simplifying Model-based RL: Learning Representations, Latent-space Models, and Policies with One Objective (2209.08466v3)

Published 18 Sep 2022 in cs.LG, cs.AI, and cs.RO

Abstract: While reinforcement learning (RL) methods that learn an internal model of the environment have the potential to be more sample efficient than their model-free counterparts, learning to model raw observations from high dimensional sensors can be challenging. Prior work has addressed this challenge by learning low-dimensional representation of observations through auxiliary objectives, such as reconstruction or value prediction. However, the alignment between these auxiliary objectives and the RL objective is often unclear. In this work, we propose a single objective which jointly optimizes a latent-space model and policy to achieve high returns while remaining self-consistent. This objective is a lower bound on expected returns. Unlike prior bounds for model-based RL on policy exploration or model guarantees, our bound is directly on the overall RL objective. We demonstrate that the resulting algorithm matches or improves the sample-efficiency of the best prior model-based and model-free RL methods. While sample efficient methods typically are computationally demanding, our method attains the performance of SAC in about 50% less wall-clock time.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (87)
  1. Policy-aware model learning for policy gradient methods, 2020. URL https://arxiv.org/abs/2003.00030.
  2. A review of uncertainty quantification in deep learning: Techniques, applications and challenges. Information Fusion, 76:243–297, 2021. ISSN 1566-2535. doi: https://doi.org/10.1016/j.inffus.2021.05.008. URL https://www.sciencedirect.com/science/article/pii/S1566253521001081.
  3. Maximum a posteriori policy optimisation, 2018. URL https://arxiv.org/abs/1806.06920.
  4. Differentiable mpc for end-to-end planning and control, 2018. URL https://arxiv.org/abs/1810.13400.
  5. On the model-based stochastic value gradient for continuous reinforcement learning, 2020. URL https://arxiv.org/abs/2008.12775.
  6. Deciding what to model: Value-equivalent sampling for reinforcement learning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=fORXbIlTELP.
  7. Hagai Attias. Planning by probabilistic inference. In Christopher M. Bishop and Brendan J. Frey (eds.), Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, volume R4 of Proceedings of Machine Learning Research, pp.  9–16. PMLR, 03–06 Jan 2003. URL https://proceedings.mlr.press/r4/attias03a.html. Reissued by PMLR on 01 April 2021.
  8. Layer normalization, 2016. URL https://arxiv.org/abs/1607.06450.
  9. Information prioritization through empowerment in visual model-based rl. In International Conference on Learning Representations, 2021.
  10. Planning as inference. Trends in Cognitive Sciences, 16(10):485–488, 2012. ISSN 1364-6613. doi: https://doi.org/10.1016/j.tics.2012.08.006. URL https://www.sciencedirect.com/science/article/pii/S1364661312001957.
  11. Sample-efficient reinforcement learning with stochastic ensemble value expansion. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/f02208a057804ee16ac72ff4d3cec53b-Paper.pdf.
  12. Learning and querying fast generative models for reinforcement learning, 2018. URL https://arxiv.org/abs/1802.03006.
  13. Randomized ensembled double q-learning: Learning fast without a model, 2021. URL https://arxiv.org/abs/2101.05982.
  14. Deep reinforcement learning in a handful of trials using probabilistic dynamics models, 2018. URL https://arxiv.org/abs/1805.12114.
  15. Model-augmented actor-critic: Backpropagating through paths, 2020. URL https://arxiv.org/abs/2005.08068.
  16. Pilco: A model-based and dataefficient approach to policy search. In In Proceedings of the Twenty-Eighth International Conference on Machine Learning (ICML, 2011.
  17. Dreamerpro: Reconstruction-free model-based reinforcement learning with prototypical representations, 2021. URL https://arxiv.org/abs/2110.14565.
  18. Gradient-aware model-based policy search. Proceedings of the AAAI Conference on Artificial Intelligence, 34(04):3801–3808, apr 2020. doi: 10.1609/aaai.v34i04.5791. URL https://doi.org/10.1609%2Faaai.v34i04.5791.
  19. Provable rl with exogenous distractors via multistep inverse dynamics. arXiv preprint arXiv:2110.08847, 2021.
  20. Off-Dynamics Reinforcement Learning: Training for Transfer with Domain Classifiers. arXiv e-prints, art. arXiv:2006.13916, June 2020.
  21. Mismatched no more: Joint model-policy optimization for model-based rl, 2021a. URL https://arxiv.org/abs/2110.02758.
  22. Robust predictable control, 2021b. URL https://arxiv.org/abs/2109.03214.
  23. Amir-massoud Farahmand. Iterative value-aware model learning. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/7a2347d96752880e3d58d72e9813cc14-Paper.pdf.
  24. Value-Aware Loss Function for Model-based Reinforcement Learning. In Aarti Singh and Jerry Zhu (eds.), Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pp. 1486–1494. PMLR, 20–22 Apr 2017. URL https://proceedings.mlr.press/v54/farahmand17a.html.
  25. Model-based value estimation for efficient model-free reinforcement learning, 2018. URL https://arxiv.org/abs/1803.00101.
  26. Addressing function approximation error in actor-critic methods. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.  1587–1596. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/fujimoto18a.html.
  27. Model predictive control: Theory and practice - a survey. Autom., 25(3):335–348, 1989. URL http://dblp.uni-trier.de/db/journals/automatica/automatica25.html#GarciaPM89.
  28. Loss surfaces, mode connectivity, and fast ensembling of dnns, 2018. URL https://arxiv.org/abs/1802.10026.
  29. Reinforcement learning with competitive ensembles of information-constrained primitives, 2019. URL https://arxiv.org/abs/1906.10667.
  30. Bootstrap your own latent: A new approach to self-supervised learning, 2020. URL https://arxiv.org/abs/2006.07733.
  31. The value equivalence principle for model-based reinforcement learning, 2020. URL https://arxiv.org/abs/2011.03506.
  32. Learning invariant feature spaces to transfer skills with reinforcement learning, 2017. URL https://arxiv.org/abs/1703.02949.
  33. Recurrent world models facilitate policy evolution. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/2de5d16682c3c35007e4e92982f1a2ba-Paper.pdf.
  34. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor, 2018. URL https://arxiv.org/abs/1801.01290.
  35. Learning latent dynamics for planning from pixels, 2018. URL https://arxiv.org/abs/1811.04551.
  36. Dream to control: Learning behaviors by latent imagination, 2019. URL https://arxiv.org/abs/1912.01603.
  37. Mastering atari with discrete world models, 2020. URL https://arxiv.org/abs/2010.02193.
  38. On the role of planning in model-based deep reinforcement learning, 2020. URL https://arxiv.org/abs/2011.04021.
  39. Temporal difference learning for model predictive control. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  8387–8406. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/hansen22a.html.
  40. Learning continuous control policies by stochastic value gradients, 2015. URL https://arxiv.org/abs/1510.09142.
  41. Hallucinating value: A pitfall of dyna-style planning with imperfect environment models, 2020. URL https://arxiv.org/abs/2006.04363.
  42. When to trust your model: Model-based policy optimization, 2019. URL https://arxiv.org/abs/1906.08253.
  43. Gamma-models: Generative temporal difference learning for infinite-horizon prediction. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  1724–1735. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/12ffb0968f2f56e51a59a6beb37b2859-Paper.pdf.
  44. An introduction to variational methods for graphical models. Machine learning, 37(2):183–233, 1999.
  45. Reinforcement learning with misspecified model classes. In 2013 IEEE International Conference on Robotics and Automation, pp.  939–946, 2013. doi: 10.1109/ICRA.2013.6630686.
  46. Model-based reinforcement learning for atari, 2019. URL https://arxiv.org/abs/1903.00374.
  47. Optimal control as a graphical model inference problem. Machine learning, 87(2):159–182, 2012.
  48. Model-ensemble trust-region policy optimization, 2018. URL https://arxiv.org/abs/1802.10592.
  49. Objective mismatch in model-based reinforcement learning. ArXiv, abs/2002.04523, 2020.
  50. Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model. Advances in Neural Information Processing Systems, 33:741–752, 2020.
  51. Offline reinforcement learning: Tutorial, review, and perspectives on open problems, 2020. URL https://arxiv.org/abs/2005.01643.
  52. Continuous control with deep reinforcement learning, 2015. URL https://arxiv.org/abs/1509.02971.
  53. Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees, 2018. URL https://arxiv.org/abs/1807.03858.
  54. Gradients are not all you need, 2021. URL https://arxiv.org/abs/2111.05803.
  55. Kinematic state abstraction and provably efficient rich-observation reinforcement learning. In International conference on machine learning, pp. 6961–6971. PMLR, 2020.
  56. Model predictive actor-critic: Accelerating robot skill acquisition with deep reinforcement learning. In 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, may 2021. doi: 10.1109/icra48506.2021.9561298. URL https://doi.org/10.1109%2Ficra48506.2021.9561298.
  57. Temporal predictive coding for model-based planning in latent space. In International Conference on Machine Learning, pp. 8130–8139. PMLR, 2021.
  58. Control-oriented model-based reinforcement learning with implicit differentiation, 2021. URL https://arxiv.org/abs/2106.03273.
  59. Action-conditional video prediction using deep networks in atari games. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, pp.  2863–2871, Cambridge, MA, USA, 2015. MIT Press.
  60. Value prediction network, 2017. URL https://arxiv.org/abs/1707.03497.
  61. Dreaming: Model-based reinforcement learning by latent imagination without reconstruction, 2020. URL https://arxiv.org/abs/2007.14535.
  62. Path integral networks: End-to-end differentiable optimal control, 2017. URL https://arxiv.org/abs/1706.09597.
  63. Pipps: Flexible model-based policy search robust to the curse of chaos, 2019. URL https://arxiv.org/abs/1902.01240.
  64. Relative entropy policy search. In AAAI, 2010.
  65. A survey on offline reinforcement learning: Taxonomy, review, and open problems, 2022. URL https://arxiv.org/abs/2203.01387.
  66. Imagination-augmented agents for deep reinforcement learning. ArXiv, abs/1707.06203, 2017.
  67. A game theoretic framework for model based reinforcement learning, 2020. URL https://arxiv.org/abs/2004.07804.
  68. Which mutual-information representation learning objectives are sufficient for control? Advances in Neural Information Processing Systems, 34:26345–26357, 2021.
  69. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, dec 2020. doi: 10.1038/s41586-020-03051-4. URL https://doi.org/10.1038%2Fs41586-020-03051-4.
  70. High-dimensional continuous control using generalized advantage estimation, 2015. URL https://arxiv.org/abs/1506.02438.
  71. Model-based policy optimization with unsupervised model adaptation, 2020. URL https://arxiv.org/abs/2010.09546.
  72. Learning off-policy with online planning, 2020. URL https://arxiv.org/abs/2008.10066.
  73. Local search for policy iteration in continuous control, 2020. URL https://arxiv.org/abs/2010.05545.
  74. Richard S. Sutton. Dyna, an integrated architecture for learning, planning, and reacting. SIGART Bulletin, 2(4):160–163, 1991.
  75. Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018. URL http://incompleteideas.net/book/the-book-2nd.html.
  76. Value iteration networks. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper/2016/file/c21002f464c5fc5bee3b98ced83963b8-Paper.pdf.
  77. Russ Tedrake. Underactuated Robotics. Course Notes for MIT 6.832, 2022. URL http://underactuated.mit.edu.
  78. Issues in using function approximation for reinforcement learning. In Proceedings of the 1993 Connectionist Models Summer School Hillsdale, NJ. Lawrence Erlbaum, volume 6, pp.  1–9, 1993.
  79. Marc Toussaint. Robot trajectory optimization using approximate inference. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pp.  1049–1056, New York, NY, USA, 2009. Association for Computing Machinery. ISBN 9781605585161. doi: 10.1145/1553374.1553508. URL https://doi.org/10.1145/1553374.1553508.
  80. Value gradient weighted model-based reinforcement learning, 2022. URL https://arxiv.org/abs/2204.01464.
  81. Exploring model-based planning with policy networks, 2019. URL https://arxiv.org/abs/1906.08649.
  82. Benchmarking model-based reinforcement learning, 2019. URL https://arxiv.org/abs/1907.02057.
  83. How good is the bayes posterior in deep neural networks really?, 2020. URL https://arxiv.org/abs/2002.02405.
  84. Latent skill planning for exploration and transfer. In International Conference on Learning Representations, 2020.
  85. Mastering visual continuous control: Improved data-augmented reinforcement learning, 2021. URL https://arxiv.org/abs/2107.09645.
  86. Reward is enough for convex mdps, 2021. URL https://arxiv.org/abs/2106.00661.
  87. Learning invariant representations for reinforcement learning without reconstruction, 2020. URL https://arxiv.org/abs/2006.10742.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Raj Ghugare (4 papers)
  2. Homanga Bharadhwaj (36 papers)
  3. Benjamin Eysenbach (59 papers)
  4. Sergey Levine (531 papers)
  5. Ruslan Salakhutdinov (248 papers)
Citations (23)

Summary

The Unified Objective: Formulation and Derivation

Model-based reinforcement learning (MBRL) aims to improve sample efficiency over model-free methods by learning a dynamics model. However, in settings with high-dimensional observations (e.g., images), learning accurate world models directly on raw inputs is challenging, and prediction errors can compound. A common approach is to learn a low-dimensional representation ztz_t of the observation sts_t using an encoder eϕ(ztst)e_\phi(z_t | s_t), learn a latent dynamics model mϕ(zt+1zt,at)m_\phi(z_{t+1} | z_t, a_t), and train a policy πϕ(atzt)\pi_\phi(a_t | z_t) in this latent space. This decomposition often involves separate objectives: the encoder might use reconstruction or contrastive losses, the model typically uses maximum likelihood estimation (MLE), and the policy maximizes expected rewards. This separation can lead to an "objective mismatch," where components are optimized for goals not perfectly aligned with maximizing task returns. For instance, a representation optimal for reconstruction may not be optimal for control, and a model accurate under the MLE objective might perform poorly in states relevant to the policy.

The paper proposes a single objective function derived from a variational lower bound on the expected return, which jointly optimizes the encoder, latent model, and policy. The standard RL objective is to maximize the expected discounted return Ep(τ)[R(τ)]E_{p(\tau)}[R(\tau)], where p(τ)=p0(s0)t=0H1p(st+1st,at)π(atst)p(\tau) = p_0(s_0) \prod_{t=0}^{H-1} p(s_{t+1} | s_t, a_t) \pi(a_t | s_t) represents the trajectory distribution under the true environment dynamics p(st+1st,at)p(s_{t+1} | s_t, a_t) and the policy π(atst)\pi(a_t|s_t), and R(τ)=t=0Hγtr(st,at)R(\tau) = \sum_{t=0}^{H} \gamma^t r(s_t, a_t) is the discounted return. The key idea is to treat this as maximizing Ep(τ)[R(τ)]=p(τ)R(τ)dτE_{p(\tau)}[R(\tau)] = \int p(\tau) R(\tau) d\tau and apply Jensen's inequality to get a lower bound on the log expected return: logEp(τ)[R(τ)]\log E_{p(\tau)}[R(\tau)].

Introducing a proposal distribution q(τ)q(\tau) and using the standard Evidence Lower Bound (ELBO) derivation gives:

logEp(τ)[R(τ)]Eq(τ)[logR(τ)+logp(τ)logq(τ)]\log E_{p(\tau)}[R(\tau)] \ge E_{q(\tau)}[\log R(\tau) + \log p(\tau) - \log q(\tau)]

The crucial step is the design of the proposal distribution q(τ)q(\tau). Instead of modeling the true, high-dimensional state transitions p(st+1st,at)p(s_{t+1} | s_t, a_t), the paper defines a K-step latent-space proposal distribution qϕK(τ)q_\phi^K(\tau) based on the learned encoder eϕe_\phi, policy πϕ\pi_\phi, and latent dynamics model mϕm_\phi: qϕK(τ)=p0(s0)eϕ(z0s0)πϕ(a0z0)t=1Kp(stst1,at1)mϕ(ztzt1,at1)πϕ(atzt)q_{\phi}^K(\tau) = p_0(s_0) e_{\phi}(z_0 | s_0) \pi_{\phi}(a_0 | z_0) \prod_{t=1}^{K} p(s_t | s_{t-1}, a_{t-1}) m_{\phi}(z_{t} | z_{t-1}, a_{t-1}) \pi_{\phi}(a_t | z_t). Note that this proposal samples the true next states sts_t from the environment up to step KK, but generates the latent states ztz_t for t>0t>0 using the learned latent model mϕm_\phi.

Substituting qϕKq_\phi^K into the ELBO and simplifying yields the paper's central objective function (Theorem 3.1), a lower bound on the expected return:

LϕK=EqϕK(τ)[(t=0K1γtr~(st,at,st+1))+γKlogQ(sK,aK)]L^K_\phi = E_{q_{\phi}^K(\tau)} \left[ \left( \sum_{t=0}^{K-1} \gamma^{t} \tilde{r}(s_t, a_t, s_{t+1}) \right) + \gamma^{K} \log Q(s_K, a_K) \right]

Here, Q(sK,aK)Q(s_K, a_K) represents the value beyond the K-step horizon, and r~\tilde{r} is an augmented reward: r~(st,at,st+1)=(1γ)logr(st,at)(a) Extrinsic Reward Term+logeϕ(zt+1st+1)logmϕ(zt+1zt,at)(b) Intrinsic Consistency Term\tilde{r} (s_t, a_t, s_{t+1}) = \underbrace{(1-\gamma) \log r(s_t, a_t)}_{\text{(a) Extrinsic Reward Term}} + \underbrace{\log e_{\phi}(z_{t+1} | s_{t+1}) - \log m_{\phi}(z_{t+1} | z_{t}, a_{t})}_{\text{(b) Intrinsic Consistency Term}}.

The objective LϕKL^K_\phi depends on the parameters ϕ\phi shared by the encoder eϕe_\phi, the latent model mϕm_\phi, and the policy πϕ\pi_\phi. Maximizing LϕKL^K_\phi simultaneously optimizes all three components:

  1. The extrinsic term (a) encourages maximizing the standard environment rewards (scaled and log-transformed).
  2. The intrinsic term (b) encourages consistency between the latent model's prediction mϕ(zt+1zt,at)m_\phi(z_{t+1} | z_t, a_t) and the representation of the actual next state encoded by eϕ(zt+1st+1)e_\phi(z_{t+1} | s_{t+1}). This term can be interpreted as minimizing the KL divergence DKL(eϕ(st+1)mϕ(zt,at))D_{KL}(e_\phi(\cdot | s_{t+1}) || m_\phi(\cdot | z_t, a_t)) under the distribution generated by qϕKq_\phi^K. It incentivizes the model to predict representations accurately and the encoder to produce representations that are predictable by the model.

This formulation directly links the representation learning, model learning, and policy optimization through a single objective derived from the fundamental RL goal, thus mitigating the objective mismatch problem.

The Aligned Latent Models (ALM) Algorithm

The practical algorithm, Aligned Latent Models (ALM), implements the optimization of the LϕKL^K_\phi objective within an actor-critic framework, specifically building upon DDPG. It maintains an encoder eϕe_\phi, a latent dynamics model mϕm_\phi, a policy πϕ\pi_\phi, a Q-function Qθ(zt,at)Q_\theta(z_t, a_t), and a reward predictor rθ(zt,at)r_\theta(z_t, a_t). Target networks are used for stability.

Updates from Replay Buffer (Real Transitions):

Sequences of K steps (si,ai,ri,si+1)i=tt+K1(s_i, a_i, r_i, s_{i+1})_{i=t}^{t+K-1} are sampled from a replay buffer. These sequences are used to update the encoder, latent model, Q-function, and reward predictor.

  1. Encoder and Latent Model Update: These components are updated by maximizing the expected augmented reward terms within LϕKL^K_\phi using the sampled real transitions. Specifically, they maximize:

    E[i=tt+K1γit(logrθ(zi,ai)+logetarg(zi+1si+1)logmϕ(zi+1zi,ai))]E \left[ \sum_{i=t}^{t+K-1} \gamma^{i-t} (\log r_\theta(z_i, a_i) + \log e_{\text{targ}}(z_{i+1} | s_{i+1}) - \log m_\phi(z_{i+1} | z_i, a_i)) \right]

    where zi=eϕ(si)z_i = e_\phi(s_i), zi+1=eϕ(si+1)z_{i+1} = e_\phi(s_{i+1}), and etarge_{\text{targ}} is a target encoder. Note the use of the learned reward predictor rθr_\theta and the target encoder for stability in practice. The log transform on the reward is omitted in the implementation, compensated by scaling the consistency term.

  2. Q-Function Update: Updated using standard TD-learning on the K-step targets computed using real transitions and the reward predictor: Target: yt=i=tt+K1γitrθ(zi,ai)+γKQtarg(zt+K,πϕ(zt+K))y_t = \sum_{i=t}^{t+K-1} \gamma^{i-t} r_\theta(z_i, a_i) + \gamma^K Q_{\text{targ}}(z_{t+K}, \pi_\phi(z_{t+K})) Loss: E[(Qθ(zt,at)yt)2]E[(Q_\theta(z_t, a_t) - y_t)^2]
  3. Reward Predictor Update: Trained via MSE loss on real transitions: Loss: E[(rθ(zt,at)rt)2]E[(r_\theta(z_t, a_t) - r_t)^2]

Policy Update (Imagined Rollouts):

The policy πϕ\pi_\phi is updated to maximize the objective LϕKL^K_\phi using K-step trajectories imagined entirely within the latent space. Starting from an initial latent state ztz_t (encoded from a sampled sts_t), the agent rolls out actions aiπϕ(zi)a_i \sim \pi_\phi(\cdot | z_i) and transitions z^i+1mϕ(zi,ai)\hat{z}_{i+1} \sim m_\phi(\cdot | z_i, a_i) for KK steps.

A key challenge arises here: during imagined rollouts, the true next state si+1s_{i+1} is unavailable, making it impossible to compute the encoder term logeϕ(zi+1si+1)\log e_\phi(z_{i+1} | s_{i+1}) needed for the intrinsic consistency reward. To address this, ALM approximates the log-likelihood difference logeϕ(zt+1st+1)logmϕ(zt+1zt,at)\log e_\phi(z_{t+1}|s_{t+1}) - \log m_\phi(z_{t+1}|z_t, a_t) using a learned binary classifier Cψ(zt+1,at,zt)C_\psi(z_{t+1}, a_t, z_t). This classifier is trained on data from the replay buffer to distinguish "real" next latent states zt+1eϕ(st+1)z_{t+1} \sim e_\phi(\cdot|s_{t+1}) (labeled 1) from "model-predicted" latent states z^t+1mϕ(zt,at)\hat{z}_{t+1} \sim m_\phi(\cdot|z_t, a_t) (labeled 0). The log-odds ratio of the classifier approximates the desired log-likelihood ratio: logpreal(zt+1zt,at)pmodel(zt+1zt,at)logit(Cψ(zt+1,at,zt))\log \frac{p_{\text{real}}(z_{t+1}|z_t, a_t)}{p_{\text{model}}(z_{t+1}|z_t, a_t)} \approx \text{logit}(C_\psi(z_{t+1}, a_t, z_t)).

The policy is updated by maximizing the expected return over these imagined K-step rollouts, incorporating the classifier-based estimate of the intrinsic reward:

E[i=tt+K1γit(rθ(zi,ai)+clogit(Cψ(z^i+1,ai,zi)))+γKQθ(zt+K,at+K)]E \left[ \sum_{i=t}^{t+K-1} \gamma^{i-t} (r_\theta(z_i, a_i) + c \cdot \text{logit}(C_\psi(\hat{z}_{i+1}, a_i, z_i))) + \gamma^K Q_\theta(z_{t+K}, a_{t+K}) \right]

where ai=πϕ(zi)a_i = \pi_\phi(z_i), z^i+1=mϕ(zi,ai)\hat{z}_{i+1} = m_\phi(z_i, a_i), and cc is a hyperparameter scaling the intrinsic reward (set to 0.1 in the paper, balancing the modified extrinsic and intrinsic terms).

This process jointly trains all components. The encoder and model learn from real data to be self-consistent and predict rewards, while the policy leverages the learned model and the consistency objective (via the classifier) to find high-reward trajectories where the model is reliable.

Theoretical and Empirical Contributions

The primary theoretical contribution is the derivation of LϕKL^K_\phi as a lower bound on the log expected return for MBRL with latent variables. Prior theoretical work on MBRL often provided bounds related to model accuracy or exploration guarantees, but LϕKL^K_\phi directly bounds the overall RL objective. This provides a principled foundation for simultaneously optimizing the representation, model, and policy towards the ultimate goal of maximizing returns.

Empirically, ALM demonstrates strong performance on continuous control benchmarks, particularly in terms of sample efficiency.

  • Sample Efficiency: On the MBRL benchmark from Wang et al. (2019), ALM generally outperforms prior methods like SAC-SVG, SLBO, TD3, and SAC at 2e5 environment steps. On standard MuJoCo tasks, ALM achieves sample efficiency comparable to state-of-the-art ensemble-based methods like MBPO and REDQ, reaching near-optimal performance significantly faster than purely model-free methods like SAC.
  • Computational Efficiency: A significant practical advantage of ALM is its computational efficiency. Unlike MBPO and REDQ, which rely on computationally expensive model ensembles to mitigate model errors, ALM achieves high performance using only a single latent dynamics model. This leads to substantially faster updates (~10x faster than MBPO, ~6x faster than REDQ according to the paper) and reduced wall-clock training time. The paper reports achieving performance comparable to SAC in approximately 50% less wall-clock time.
  • Addressing Objective Mismatch: The unified objective inherently encourages alignment. The encoder learns representations that are both predictable by the model (due to the intrinsic term) and useful for predicting rewards/value (needed for policy optimization). The model is trained to be accurate specifically on the representations produced by the encoder along policy-relevant trajectories. The policy is incentivized, via the intrinsic reward, to explore regions where the latent model's predictions align with the encoder's output on real data, promoting self-consistency and avoiding exploitation of model inaccuracies.

Conclusion

The work introduces a variational lower bound on the expected return that serves as a unified objective for jointly learning representations, latent dynamics models, and policies in MBRL. The resulting algorithm, ALM, leverages this objective to achieve high sample efficiency comparable to ensemble-based methods, but with significantly improved computational efficiency due to its single-model architecture. By optimizing all components towards a common goal derived directly from the RL objective, ALM provides a principled approach to mitigate the objective mismatch problem inherent in many prior MBRL methods that rely on auxiliary losses for representation and model learning.