Latent Plan Transformer

Updated 23 December 2025

Latent Plan Transformer is a model that introduces a continuous latent plan variable to enable credit assignment and trajectory stitching without step‐wise rewards.
It employs a causal Transformer-based trajectory generator conditioned on latent plans, trained via maximum likelihood with MCMC sampling to achieve temporal consistency.
Empirical evaluations demonstrate that LPT outperforms baseline models on offline RL benchmarks by ensuring cohesive long-horizon planning and effective trajectory composition.

The Latent Plan Transformer (LPT) is a generative model for trajectory abstraction and planning, specifically designed to address offline reinforcement learning (RL) settings where only full-trajectory returns are available and step-wise reward signals are absent. LPT introduces a latent continuous variable, termed the "plan," which enables the enforcement of temporal consistency across entire episodes, credit assignment over long horizons, and compositional planning via trajectory stitching. Its distinguishing algorithmic feature is planning as latent space inference, realized by a Transformer-based trajectory generator conditioned on the plan variable, with learning and inference achieved via maximum likelihood estimation and Markov Chain Monte Carlo (MCMC) sampling in the latent space (Kong et al., 7 Feb 2024).

1. Problem Formulation and Motivation

LPT is motivated by the challenge of long-term planning with offline RL datasets $D = \{(\tau_i, y_i)\}$ , where each $\tau = (s_1, a_1, \ldots, s_H, a_H)$ is a trajectory of state-action pairs and $y = \sum_{t=1}^H r(s_{\leq t}, a_{\leq t})$ is the total return. In this setting:

Credit assignment: Effective association of sparse/delayed rewards to temporally distant actions is difficult in the absence of step-wise reward signals.
Trajectory stitching: Construction of new, high-return trajectories from observed suboptimal fragments.
Temporal consistency: Mitigation against policy drift in autoregressive models that operate on finite context, conditioned solely on past states and a single summary return.

LPT addresses these issues by introducing a latent "plan" variable $z$ that generates trajectories and predicts scalar returns, facilitating episode-level coherence and scalable planning as latent variable inference (Kong et al., 7 Feb 2024).

2. Probabilistic Model and Inference

LPT defines a joint generative model: $p_\theta(\tau, y, z) = p_\alpha(z) \cdot p_\beta(\tau|z) \cdot p_\gamma(y|z),\quad \theta = (\alpha, \beta, \gamma)$ where

$p_\alpha(z)$ : Plan prior. Implicit, with $z_0 \sim \mathcal{N}(0, I)$ mapped to $z = U_\alpha(z_0)$ via a neural network (U-Net or MLP).
$p_\beta(\tau|z)$ : Trajectory generator. An autoregressive, causal Transformer operating over finite context $K$ ; each token predicts $p_\beta(s_t, a_t|s_{t-K:t-1}, a_{t-K:t-1}, z)$ .
$p_\gamma(y|z)$ : Return predictor. Gaussian likelihood $p_\gamma(y|z) = \mathcal{N}(y; r_\gamma(z), \sigma^2)$ , with $r_\gamma$ an MLP and $\sigma^2$ fixed.

The evidence lower bound (ELBO) for maximum likelihood training is: $\mathcal{L}(\tau, y) = \mathbb{E}_{q(z|\tau, y)}[\log p_\beta(\tau|z) + \log p_\gamma(y|z)] - \mathrm{KL}[q(z|\tau, y) \Vert p_\alpha(z)]$ with approximate posterior $q(z|\tau, y)$ . If $q = p_\theta(z|\tau, y)$ , the bound is tight. Marginal likelihood involves integrating out $z$ : $\log p_\theta(\tau, y) = \log \int p_\alpha(z) p_\beta(\tau|z) p_\gamma(y|z) dz$ (Kong et al., 7 Feb 2024).

3. Model Architecture and Training

LPT's architecture comprises:

Plan prior: Samples $z_0 \sim \mathcal{N}(0, I)$ , maps to $z = U_\alpha(z_0)$ , where $U_\alpha$ is either a U-Net or MLP, representing an implicit prior $p_\alpha(z)$ .
Trajectory generator: Stack of $N$ Transformer blocks with causal self-attention over the past $K$ tokens and cross-attention from $z$ at each token position. At each timestep $t$ , outputs a Gaussian policy $a_t \sim \mathcal{N}(\mu_\beta(\cdot), I)$ .
Return head: MLP $r_\gamma(z)$ computing the mean for the Gaussian return predictor.

Training algorithm:

LPT is optimized via offline maximum likelihood, leveraging (approximate) posterior sampling in the latent space.
For each training example, $p_\theta(z|\tau, y)$ is approximated via Langevin dynamics on $z_0$ , where transitions:

$z_0^{k+1} = z_0^k + s \nabla_{z_0} \log p(z_0, \tau, y) + \sqrt{2s} \epsilon^k$

are performed, with gradients over the joint log-probability for the trajectory and return.

Model parameters $(\alpha, \beta, \gamma)$ are updated by gradient ascent using empirical averages over the sampled $z_0$ (Kong et al., 7 Feb 2024).

4. Planning as Latent Space Inference

At test time, LPT realizes planning as inference by conditioning on a desired return $y^*$ . The plan $z^*$ is inferred as the mode of $p_\theta(z|y^*)$ via MCMC in the latent space: $z_0^{k+1} = z_0^k + s \nabla_{z_0} \left[ \log p_0(z_0) + \log p_\gamma(y^* | U_\alpha(z_0)) \right] + \sqrt{2s}\epsilon$ After $N$ steps, set $z^* = U_\alpha(z_0^N)$ . An episode is then generated by rolling out the autoregressive policy: $a_t \sim p_\beta(a_t|s_{t-K:t-1}, a_{t-K:t-1}, z^*)$

$s_{t+1} \sim \text{Env}(s_t, a_t)$

This approach enables specification of arbitrary target returns, framing planning as finding a plan $z^*$ most compatible with the desired outcome in the learned latent space (Kong et al., 7 Feb 2024).

5. Empirical Evaluation and Results

LPT is benchmarked on a range of environments:

Domain	Tasks/Subsets	Characteristics
Gym-Mujoco	HalfCheetah, Hopper, Walker2D (medium, replay), AntMaze (umaze, diverse)	Continuous control, dense/sparse reward
Maze2D	umaze, medium, large	Navigation, sparse reward
Connect Four	vs. stochastic opponent	Board game, adversarial

Baselines: CQL, Decision Transformer (DT), Q-learning Decision Transformer (QDT), Online DT (ODT), ESPER.

Metrics: Average return $\pm$ standard deviation over 5 seeds.

Key statistical findings:

On Gym-Mujoco with only final return supervision, LPT outperforms DT and QDT, at times matching or exceeding CQL, which has access to step-wise rewards.
On Maze2D and AntMaze, LPT yields $2\times$ – $5\times$ improvement over DT by stitching suboptimal trajectory fragments into near-optimal full trajectories.
On Connect Four, LPT achieves performance ( $0.99 \pm 0.01$ ) matching SOTA ESPER ( $0.99 \pm 0.03$ ), substantially outperforming DT ( $0.8 \pm 0.07$ ) (Kong et al., 7 Feb 2024).

Qualitative insights:

Posterior gradients in $z$ integrate reward information from sub-trajectories, automating credit assignment.
t-SNE visualizations indicate the latent plan space allows interpolation between trajectories, capturing novel, high-return behaviors via trajectory stitching.
During execution, the latent $z$ is fixed, but the policy adapts to environment stochasticity, limiting overfitting to dataset contingencies.

6. Strengths, Limitations, and Future Prospects

Strengths:

Enforces temporal consistency over entire episodes without explicit step-wise reward or return-to-go conditioning.
Posterior sampling in latent space creates abstractions aggregating information across finite-context fragments.
Planning as inference (MCMC on $z$ ) allows for return-conditioned generation without relying on reward-to-go as input.
Demonstrates competitive or superior empirical results across dense, sparse, and adversarial benchmarks, supporting long-horizon credit assignment and trajectory composition (Kong et al., 7 Feb 2024).

Limitations and open questions:

MCMC-based latent sampling scales poorly for very long horizons. While persistent chains and fewer steps partially mitigate this, further advances such as amortized inference are desirable.
The implicit latent prior $p_\alpha(z)$ lacks explicit density modeling; replacing or augmenting it with normalizing flows or energy-based models may improve expressiveness.
Extending LPT to multi-task or hierarchical RL with discrete/discontinuous returns is an open direction.
Online continual fine-tuning currently yields limited gains; integrating the LPT posterior machinery with value-based pessimistic objectives remains an open research problem (Kong et al., 7 Feb 2024).

PDF Markdown Chat (Pro)

References (1)

Latent Plan Transformer for Trajectory Abstraction: Planning as Latent Space Inference (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Latent Plan Transformer.