Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

119 tokens/sec

GPT-4o

56 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

6 tokens/sec

GPT-4.1 Pro

47 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Simplifying Model-based RL: Learning Representations, Latent-space Models, and Policies with One Objective (2209.08466v3)

Published 18 Sep 2022 in cs.LG, cs.AI, and cs.RO

Abstract: While reinforcement learning (RL) methods that learn an internal model of the environment have the potential to be more sample efficient than their model-free counterparts, learning to model raw observations from high dimensional sensors can be challenging. Prior work has addressed this challenge by learning low-dimensional representation of observations through auxiliary objectives, such as reconstruction or value prediction. However, the alignment between these auxiliary objectives and the RL objective is often unclear. In this work, we propose a single objective which jointly optimizes a latent-space model and policy to achieve high returns while remaining self-consistent. This objective is a lower bound on expected returns. Unlike prior bounds for model-based RL on policy exploration or model guarantees, our bound is directly on the overall RL objective. We demonstrate that the resulting algorithm matches or improves the sample-efficiency of the best prior model-based and model-free RL methods. While sample efficient methods typically are computationally demanding, our method attains the performance of SAC in about 50% less wall-clock time.

References (87)

Authors (5)

Raj Ghugare (4 papers)
Homanga Bharadhwaj (36 papers)
Benjamin Eysenbach (59 papers)
Sergey Levine (531 papers)
Ruslan Salakhutdinov (248 papers)

Citations (23)

View on Semantic Scholar

Summary

The Unified Objective: Formulation and Derivation

Model-based reinforcement learning (MBRL) aims to improve sample efficiency over model-free methods by learning a dynamics model. However, in settings with high-dimensional observations (e.g., images), learning accurate world models directly on raw inputs is challenging, and prediction errors can compound. A common approach is to learn a low-dimensional representation $z_t$ of the observation $s_t$ using an encoder $e_\phi(z_t | s_t)$ , learn a latent dynamics model $m_\phi(z_{t+1} | z_t, a_t)$ , and train a policy $\pi_\phi(a_t | z_t)$ in this latent space. This decomposition often involves separate objectives: the encoder might use reconstruction or contrastive losses, the model typically uses maximum likelihood estimation (MLE), and the policy maximizes expected rewards. This separation can lead to an "objective mismatch," where components are optimized for goals not perfectly aligned with maximizing task returns. For instance, a representation optimal for reconstruction may not be optimal for control, and a model accurate under the MLE objective might perform poorly in states relevant to the policy.

The paper proposes a single objective function derived from a variational lower bound on the expected return, which jointly optimizes the encoder, latent model, and policy. The standard RL objective is to maximize the expected discounted return $E_{p(\tau)}[R(\tau)]$ , where $p(\tau) = p_0(s_0) \prod_{t=0}^{H-1} p(s_{t+1} | s_t, a_t) \pi(a_t | s_t)$ represents the trajectory distribution under the true environment dynamics $p(s_{t+1} | s_t, a_t)$ and the policy $\pi(a_t|s_t)$ , and $R(\tau) = \sum_{t=0}^{H} \gamma^t r(s_t, a_t)$ is the discounted return. The key idea is to treat this as maximizing $E_{p(\tau)}[R(\tau)] = \int p(\tau) R(\tau) d\tau$ and apply Jensen's inequality to get a lower bound on the log expected return: $\log E_{p(\tau)}[R(\tau)]$ .

Introducing a proposal distribution $q(\tau)$ and using the standard Evidence Lower Bound (ELBO) derivation gives:

$\log E_{p(\tau)}[R(\tau)] \ge E_{q(\tau)}[\log R(\tau) + \log p(\tau) - \log q(\tau)]$

The crucial step is the design of the proposal distribution $q(\tau)$ . Instead of modeling the true, high-dimensional state transitions $p(s_{t+1} | s_t, a_t)$ , the paper defines a K-step latent-space proposal distribution $q_\phi^K(\tau)$ based on the learned encoder $e_\phi$ , policy $\pi_\phi$ , and latent dynamics model $m_\phi$ : $q_{\phi}^K(\tau) = p_0(s_0) e_{\phi}(z_0 | s_0) \pi_{\phi}(a_0 | z_0) \prod_{t=1}^{K} p(s_t | s_{t-1}, a_{t-1}) m_{\phi}(z_{t} | z_{t-1}, a_{t-1}) \pi_{\phi}(a_t | z_t)$ . Note that this proposal samples the true next states $s_t$ from the environment up to step $K$ , but generates the latent states $z_t$ for $t>0$ using the learned latent model $m_\phi$ .

Substituting $q_\phi^K$ into the ELBO and simplifying yields the paper's central objective function (Theorem 3.1), a lower bound on the expected return:

$L^K_\phi = E_{q_{\phi}^K(\tau)} \left[ \left( \sum_{t=0}^{K-1} \gamma^{t} \tilde{r}(s_t, a_t, s_{t+1}) \right) + \gamma^{K} \log Q(s_K, a_K) \right]$

Here, $Q(s_K, a_K)$ represents the value beyond the K-step horizon, and $\tilde{r}$ is an augmented reward: $\tilde{r} (s_t, a_t, s_{t+1}) = \underbrace{(1-\gamma) \log r(s_t, a_t)}_{\text{(a) Extrinsic Reward Term}} + \underbrace{\log e_{\phi}(z_{t+1} | s_{t+1}) - \log m_{\phi}(z_{t+1} | z_{t}, a_{t})}_{\text{(b) Intrinsic Consistency Term}}$ .

The objective $L^K_\phi$ depends on the parameters $\phi$ shared by the encoder $e_\phi$ , the latent model $m_\phi$ , and the policy $\pi_\phi$ . Maximizing $L^K_\phi$ simultaneously optimizes all three components:

The extrinsic term (a) encourages maximizing the standard environment rewards (scaled and log-transformed).
The intrinsic term (b) encourages consistency between the latent model's prediction $m_\phi(z_{t+1} | z_t, a_t)$ and the representation of the actual next state encoded by $e_\phi(z_{t+1} | s_{t+1})$ . This term can be interpreted as minimizing the KL divergence $D_{KL}(e_\phi(\cdot | s_{t+1}) || m_\phi(\cdot | z_t, a_t))$ under the distribution generated by $q_\phi^K$ . It incentivizes the model to predict representations accurately and the encoder to produce representations that are predictable by the model.

This formulation directly links the representation learning, model learning, and policy optimization through a single objective derived from the fundamental RL goal, thus mitigating the objective mismatch problem.

The Aligned Latent Models (ALM) Algorithm

The practical algorithm, Aligned Latent Models (ALM), implements the optimization of the $L^K_\phi$ objective within an actor-critic framework, specifically building upon DDPG. It maintains an encoder $e_\phi$ , a latent dynamics model $m_\phi$ , a policy $\pi_\phi$ , a Q-function $Q_\theta(z_t, a_t)$ , and a reward predictor $r_\theta(z_t, a_t)$ . Target networks are used for stability.

Updates from Replay Buffer (Real Transitions):

Sequences of K steps $(s_i, a_i, r_i, s_{i+1})_{i=t}^{t+K-1}$ are sampled from a replay buffer. These sequences are used to update the encoder, latent model, Q-function, and reward predictor.

Encoder and Latent Model Update: These components are updated by maximizing the expected augmented reward terms within $L^K_\phi$ using the sampled real transitions. Specifically, they maximize:

$E \left[ \sum_{i=t}^{t+K-1} \gamma^{i-t} (\log r_\theta(z_i, a_i) + \log e_{\text{targ}}(z_{i+1} | s_{i+1}) - \log m_\phi(z_{i+1} | z_i, a_i)) \right]$

where $z_i = e_\phi(s_i)$ , $z_{i+1} = e_\phi(s_{i+1})$ , and $e_{\text{targ}}$ is a target encoder. Note the use of the learned reward predictor $r_\theta$ and the target encoder for stability in practice. The log transform on the reward is omitted in the implementation, compensated by scaling the consistency term.
Q-Function Update: Updated using standard TD-learning on the K-step targets computed using real transitions and the reward predictor: Target: $y_t = \sum_{i=t}^{t+K-1} \gamma^{i-t} r_\theta(z_i, a_i) + \gamma^K Q_{\text{targ}}(z_{t+K}, \pi_\phi(z_{t+K}))$ Loss: $E[(Q_\theta(z_t, a_t) - y_t)^2]$
Reward Predictor Update: Trained via MSE loss on real transitions: Loss: $E[(r_\theta(z_t, a_t) - r_t)^2]$

Policy Update (Imagined Rollouts):

The policy $\pi_\phi$ is updated to maximize the objective $L^K_\phi$ using K-step trajectories imagined entirely within the latent space. Starting from an initial latent state $z_t$ (encoded from a sampled $s_t$ ), the agent rolls out actions $a_i \sim \pi_\phi(\cdot | z_i)$ and transitions $\hat{z}_{i+1} \sim m_\phi(\cdot | z_i, a_i)$ for $K$ steps.

A key challenge arises here: during imagined rollouts, the true next state $s_{i+1}$ is unavailable, making it impossible to compute the encoder term $\log e_\phi(z_{i+1} | s_{i+1})$ needed for the intrinsic consistency reward. To address this, ALM approximates the log-likelihood difference $\log e_\phi(z_{t+1}|s_{t+1}) - \log m_\phi(z_{t+1}|z_t, a_t)$ using a learned binary classifier $C_\psi(z_{t+1}, a_t, z_t)$ . This classifier is trained on data from the replay buffer to distinguish "real" next latent states $z_{t+1} \sim e_\phi(\cdot|s_{t+1})$ (labeled 1) from "model-predicted" latent states $\hat{z}_{t+1} \sim m_\phi(\cdot|z_t, a_t)$ (labeled 0). The log-odds ratio of the classifier approximates the desired log-likelihood ratio: $\log \frac{p_{\text{real}}(z_{t+1}|z_t, a_t)}{p_{\text{model}}(z_{t+1}|z_t, a_t)} \approx \text{logit}(C_\psi(z_{t+1}, a_t, z_t))$ .

The policy is updated by maximizing the expected return over these imagined K-step rollouts, incorporating the classifier-based estimate of the intrinsic reward:

$E \left[ \sum_{i=t}^{t+K-1} \gamma^{i-t} (r_\theta(z_i, a_i) + c \cdot \text{logit}(C_\psi(\hat{z}_{i+1}, a_i, z_i))) + \gamma^K Q_\theta(z_{t+K}, a_{t+K}) \right]$

where $a_i = \pi_\phi(z_i)$ , $\hat{z}_{i+1} = m_\phi(z_i, a_i)$ , and $c$ is a hyperparameter scaling the intrinsic reward (set to 0.1 in the paper, balancing the modified extrinsic and intrinsic terms).

This process jointly trains all components. The encoder and model learn from real data to be self-consistent and predict rewards, while the policy leverages the learned model and the consistency objective (via the classifier) to find high-reward trajectories where the model is reliable.

Theoretical and Empirical Contributions

The primary theoretical contribution is the derivation of $L^K_\phi$ as a lower bound on the log expected return for MBRL with latent variables. Prior theoretical work on MBRL often provided bounds related to model accuracy or exploration guarantees, but $L^K_\phi$ directly bounds the overall RL objective. This provides a principled foundation for simultaneously optimizing the representation, model, and policy towards the ultimate goal of maximizing returns.

Empirically, ALM demonstrates strong performance on continuous control benchmarks, particularly in terms of sample efficiency.

Sample Efficiency: On the MBRL benchmark from Wang et al. (2019), ALM generally outperforms prior methods like SAC-SVG, SLBO, TD3, and SAC at 2e5 environment steps. On standard MuJoCo tasks, ALM achieves sample efficiency comparable to state-of-the-art ensemble-based methods like MBPO and REDQ, reaching near-optimal performance significantly faster than purely model-free methods like SAC.
Computational Efficiency: A significant practical advantage of ALM is its computational efficiency. Unlike MBPO and REDQ, which rely on computationally expensive model ensembles to mitigate model errors, ALM achieves high performance using only a single latent dynamics model. This leads to substantially faster updates (~10x faster than MBPO, ~6x faster than REDQ according to the paper) and reduced wall-clock training time. The paper reports achieving performance comparable to SAC in approximately 50% less wall-clock time.
Addressing Objective Mismatch: The unified objective inherently encourages alignment. The encoder learns representations that are both predictable by the model (due to the intrinsic term) and useful for predicting rewards/value (needed for policy optimization). The model is trained to be accurate specifically on the representations produced by the encoder along policy-relevant trajectories. The policy is incentivized, via the intrinsic reward, to explore regions where the latent model's predictions align with the encoder's output on real data, promoting self-consistency and avoiding exploitation of model inaccuracies.

Conclusion

The work introduces a variational lower bound on the expected return that serves as a unified objective for jointly learning representations, latent dynamics models, and policies in MBRL. The resulting algorithm, ALM, leverages this objective to achieve high sample efficiency comparable to ensemble-based methods, but with significantly improved computational efficiency due to its single-model architecture. By optimizing all components towards a common goal derived directly from the RL objective, ALM provides a principled approach to mitigate the objective mismatch problem inherent in many prior MBRL methods that rely on auxiliary losses for representation and model learning.

PDF Markdown