Papers
Topics
Authors
Recent
2000 character limit reached

Latent Plan Transformer

Updated 23 December 2025
  • Latent Plan Transformer is a model that introduces a continuous latent plan variable to enable credit assignment and trajectory stitching without step‐wise rewards.
  • It employs a causal Transformer-based trajectory generator conditioned on latent plans, trained via maximum likelihood with MCMC sampling to achieve temporal consistency.
  • Empirical evaluations demonstrate that LPT outperforms baseline models on offline RL benchmarks by ensuring cohesive long-horizon planning and effective trajectory composition.

The Latent Plan Transformer (LPT) is a generative model for trajectory abstraction and planning, specifically designed to address offline reinforcement learning (RL) settings where only full-trajectory returns are available and step-wise reward signals are absent. LPT introduces a latent continuous variable, termed the "plan," which enables the enforcement of temporal consistency across entire episodes, credit assignment over long horizons, and compositional planning via trajectory stitching. Its distinguishing algorithmic feature is planning as latent space inference, realized by a Transformer-based trajectory generator conditioned on the plan variable, with learning and inference achieved via maximum likelihood estimation and Markov Chain Monte Carlo (MCMC) sampling in the latent space (Kong et al., 7 Feb 2024).

1. Problem Formulation and Motivation

LPT is motivated by the challenge of long-term planning with offline RL datasets D={(τi,yi)}D = \{(\tau_i, y_i)\}, where each τ=(s1,a1,,sH,aH)\tau = (s_1, a_1, \ldots, s_H, a_H) is a trajectory of state-action pairs and y=t=1Hr(st,at)y = \sum_{t=1}^H r(s_{\leq t}, a_{\leq t}) is the total return. In this setting:

  • Credit assignment: Effective association of sparse/delayed rewards to temporally distant actions is difficult in the absence of step-wise reward signals.
  • Trajectory stitching: Construction of new, high-return trajectories from observed suboptimal fragments.
  • Temporal consistency: Mitigation against policy drift in autoregressive models that operate on finite context, conditioned solely on past states and a single summary return.

LPT addresses these issues by introducing a latent "plan" variable zz that generates trajectories and predicts scalar returns, facilitating episode-level coherence and scalable planning as latent variable inference (Kong et al., 7 Feb 2024).

2. Probabilistic Model and Inference

LPT defines a joint generative model: pθ(τ,y,z)=pα(z)pβ(τz)pγ(yz),θ=(α,β,γ)p_\theta(\tau, y, z) = p_\alpha(z) \cdot p_\beta(\tau|z) \cdot p_\gamma(y|z),\quad \theta = (\alpha, \beta, \gamma) where

  • pα(z)p_\alpha(z): Plan prior. Implicit, with z0N(0,I)z_0 \sim \mathcal{N}(0, I) mapped to z=Uα(z0)z = U_\alpha(z_0) via a neural network (U-Net or MLP).
  • pβ(τz)p_\beta(\tau|z): Trajectory generator. An autoregressive, causal Transformer operating over finite context KK; each token predicts pβ(st,atstK:t1,atK:t1,z)p_\beta(s_t, a_t|s_{t-K:t-1}, a_{t-K:t-1}, z).
  • pγ(yz)p_\gamma(y|z): Return predictor. Gaussian likelihood pγ(yz)=N(y;rγ(z),σ2)p_\gamma(y|z) = \mathcal{N}(y; r_\gamma(z), \sigma^2), with rγr_\gamma an MLP and σ2\sigma^2 fixed.

The evidence lower bound (ELBO) for maximum likelihood training is: L(τ,y)=Eq(zτ,y)[logpβ(τz)+logpγ(yz)]KL[q(zτ,y)pα(z)]\mathcal{L}(\tau, y) = \mathbb{E}_{q(z|\tau, y)}[\log p_\beta(\tau|z) + \log p_\gamma(y|z)] - \mathrm{KL}[q(z|\tau, y) \Vert p_\alpha(z)] with approximate posterior q(zτ,y)q(z|\tau, y). If q=pθ(zτ,y)q = p_\theta(z|\tau, y), the bound is tight. Marginal likelihood involves integrating out zz: logpθ(τ,y)=logpα(z)pβ(τz)pγ(yz)dz\log p_\theta(\tau, y) = \log \int p_\alpha(z) p_\beta(\tau|z) p_\gamma(y|z) dz (Kong et al., 7 Feb 2024).

3. Model Architecture and Training

LPT's architecture comprises:

  • Plan prior: Samples z0N(0,I)z_0 \sim \mathcal{N}(0, I), maps to z=Uα(z0)z = U_\alpha(z_0), where UαU_\alpha is either a U-Net or MLP, representing an implicit prior pα(z)p_\alpha(z).
  • Trajectory generator: Stack of NN Transformer blocks with causal self-attention over the past KK tokens and cross-attention from zz at each token position. At each timestep tt, outputs a Gaussian policy atN(μβ(),I)a_t \sim \mathcal{N}(\mu_\beta(\cdot), I).
  • Return head: MLP rγ(z)r_\gamma(z) computing the mean for the Gaussian return predictor.

Training algorithm:

  • LPT is optimized via offline maximum likelihood, leveraging (approximate) posterior sampling in the latent space.
  • For each training example, pθ(zτ,y)p_\theta(z|\tau, y) is approximated via Langevin dynamics on z0z_0, where transitions:

z0k+1=z0k+sz0logp(z0,τ,y)+2sϵkz_0^{k+1} = z_0^k + s \nabla_{z_0} \log p(z_0, \tau, y) + \sqrt{2s} \epsilon^k

are performed, with gradients over the joint log-probability for the trajectory and return.

  • Model parameters (α,β,γ)(\alpha, \beta, \gamma) are updated by gradient ascent using empirical averages over the sampled z0z_0 (Kong et al., 7 Feb 2024).

4. Planning as Latent Space Inference

At test time, LPT realizes planning as inference by conditioning on a desired return yy^*. The plan zz^* is inferred as the mode of pθ(zy)p_\theta(z|y^*) via MCMC in the latent space: z0k+1=z0k+sz0[logp0(z0)+logpγ(yUα(z0))]+2sϵz_0^{k+1} = z_0^k + s \nabla_{z_0} \left[ \log p_0(z_0) + \log p_\gamma(y^* | U_\alpha(z_0)) \right] + \sqrt{2s}\epsilon After NN steps, set z=Uα(z0N)z^* = U_\alpha(z_0^N). An episode is then generated by rolling out the autoregressive policy: atpβ(atstK:t1,atK:t1,z)a_t \sim p_\beta(a_t|s_{t-K:t-1}, a_{t-K:t-1}, z^*)

st+1Env(st,at)s_{t+1} \sim \text{Env}(s_t, a_t)

This approach enables specification of arbitrary target returns, framing planning as finding a plan zz^* most compatible with the desired outcome in the learned latent space (Kong et al., 7 Feb 2024).

5. Empirical Evaluation and Results

LPT is benchmarked on a range of environments:

Domain Tasks/Subsets Characteristics
Gym-Mujoco HalfCheetah, Hopper, Walker2D (medium, replay), AntMaze (umaze, diverse) Continuous control, dense/sparse reward
Maze2D umaze, medium, large Navigation, sparse reward
Connect Four vs. stochastic opponent Board game, adversarial

Baselines: CQL, Decision Transformer (DT), Q-learning Decision Transformer (QDT), Online DT (ODT), ESPER.

Metrics: Average return ±\pm standard deviation over 5 seeds.

Key statistical findings:

  • On Gym-Mujoco with only final return supervision, LPT outperforms DT and QDT, at times matching or exceeding CQL, which has access to step-wise rewards.
  • On Maze2D and AntMaze, LPT yields 2×2\times5×5\times improvement over DT by stitching suboptimal trajectory fragments into near-optimal full trajectories.
  • On Connect Four, LPT achieves performance (0.99±0.010.99 \pm 0.01) matching SOTA ESPER (0.99±0.030.99 \pm 0.03), substantially outperforming DT (0.8±0.070.8 \pm 0.07) (Kong et al., 7 Feb 2024).

Qualitative insights:

  • Posterior gradients in zz integrate reward information from sub-trajectories, automating credit assignment.
  • t-SNE visualizations indicate the latent plan space allows interpolation between trajectories, capturing novel, high-return behaviors via trajectory stitching.
  • During execution, the latent zz is fixed, but the policy adapts to environment stochasticity, limiting overfitting to dataset contingencies.

6. Strengths, Limitations, and Future Prospects

Strengths:

  • Enforces temporal consistency over entire episodes without explicit step-wise reward or return-to-go conditioning.
  • Posterior sampling in latent space creates abstractions aggregating information across finite-context fragments.
  • Planning as inference (MCMC on zz) allows for return-conditioned generation without relying on reward-to-go as input.
  • Demonstrates competitive or superior empirical results across dense, sparse, and adversarial benchmarks, supporting long-horizon credit assignment and trajectory composition (Kong et al., 7 Feb 2024).

Limitations and open questions:

  • MCMC-based latent sampling scales poorly for very long horizons. While persistent chains and fewer steps partially mitigate this, further advances such as amortized inference are desirable.
  • The implicit latent prior pα(z)p_\alpha(z) lacks explicit density modeling; replacing or augmenting it with normalizing flows or energy-based models may improve expressiveness.
  • Extending LPT to multi-task or hierarchical RL with discrete/discontinuous returns is an open direction.
  • Online continual fine-tuning currently yields limited gains; integrating the LPT posterior machinery with value-based pessimistic objectives remains an open research problem (Kong et al., 7 Feb 2024).
Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Latent Plan Transformer.