Papers
Topics
Authors
Recent
Search
2000 character limit reached

Latent Plan Transformer (LPT)

Updated 13 April 2026
  • The paper presents LPT as a generative framework that uses a single latent plan variable to abstract complete trajectories and align them with final returns.
  • It employs variational inference with persistent Langevin sampling to handle credit assignment and trajectory stitching in offline RL settings.
  • Empirical evaluations show LPT achieves competitive or superior performance on benchmarks by effectively generating action sequences conditioned on desired returns.

The Latent Plan Transformer (LPT) is a generative modeling framework for planning in offline reinforcement learning (RL) environments where only final trajectory returns, rather than step-wise rewards, are available. LPT employs a single latent plan variable to abstract and mediate the relationship between trajectories and overall returns. The model is trained on static trajectory-return datasets, leveraging approximate maximum likelihood estimation using variational inference via persistent Langevin sampling. At test time, LPT supports planning as inference: it generates latent plans that condition on desired final returns and autoregressively produces action sequences. Empirical evaluations demonstrate that LPT achieves competitive or superior performance compared to Decision Transformer (DT) and other baseline methods on a range of offline RL benchmarks, particularly excelling in temporal credit assignment, trajectory stitching, and adaptation to environmental contingencies (Kong et al., 2024).

1. Problem Formulation and Motivations

LPT addresses the challenge of planning from offline datasets composed solely of full trajectories and their associated final returns, rather than environments where per-step rewards are observed or provided. The data consists of sequences τ=(s1,a1,,sH,aH)\tau = (s_1,a_1,\ldots,s_H,a_H) with only the scalar final return R=t=1Hr(st,at)R = \sum_{t=1}^H r(s_t,a_t) observed. This setup presents unique challenges:

  • Temporal Consistency: Models must maintain credit assignment over long horizons without step-wise rewards or return-to-go signals (unlike Decision Transformer).
  • Credit Assignment: There is no explicit signal for aligning early steps with overall return, requiring the model to internally “discover” credit allocation.
  • Trajectory Stitching: Ability to compose sub-trajectories from varying sources to improve on sub-optimal demonstration paths.

LPT is constructed to explicitly tackle these issues by introducing a trajectory-level latent variable orchestrating the generation and evaluation of behavior.

2. Model Architecture

The LPT architecture posits a single latent plan zz which functions as:

  • An abstraction of the entire trajectory.
  • The predictor of associated final return.
  • The directive for stepwise autoregressive action generation.

The generative process is formalized as:

pθ(τ,R,z)=pα(z)  pβ(τz)  pγ(Rz)p_\theta(\tau, R, z) = p_\alpha(z)\; p_\beta(\tau | z)\; p_\gamma(R | z)

Latent Plan Prior and Transformation

  • The prior pα(z)p_\alpha(z) is implemented via a learnable transformation UαU_\alpha applied to z0N(0,Id)z_0 \sim \mathcal{N}(0, I_d).
  • UαU_\alpha is typically a small UNet; the resulting distribution is implicit.

Trajectory Generator (Causal Transformer)

  • pβ(τz)=t=1Hpβ(atstK:t,atK:t1,z)p(st+1st,at)p_\beta(\tau | z) = \prod_{t=1}^H p_\beta(a_t | s_{t-K:t}, a_{t-K:t-1}, z) \, p(s_{t+1} | s_t, a_t)
  • The policy is Gaussian with mean provided by a function gβg_\beta realized by a causal Transformer.
  • Inputs: At each timestep, R=t=1Hr(st,at)R = \sum_{t=1}^H r(s_t,a_t)0 pairs are embedded and concatenated with learned positional encodings.
  • Cross-attention layers in each Transformer block broadcast R=t=1Hr(st,at)R = \sum_{t=1}^H r(s_t,a_t)1 into each step context.
  • Autoregressive next-action prediction heads are applied to token representations.

Return Predictor

  • R=t=1Hr(st,at)R = \sum_{t=1}^H r(s_t,a_t)2 regresses R=t=1Hr(st,at)R = \sum_{t=1}^H r(s_t,a_t)3 from R=t=1Hr(st,at)R = \sum_{t=1}^H r(s_t,a_t)4 using a small MLP, modeled as a Gaussian output.

Typical architectural choices: latent dimension R=t=1Hr(st,at)R = \sum_{t=1}^H r(s_t,a_t)5, Transformer depth R=t=1Hr(st,at)R = \sum_{t=1}^H r(s_t,a_t)6–R=t=1Hr(st,at)R = \sum_{t=1}^H r(s_t,a_t)7 layers, embedding size 128–192, 1–8 heads.

3. Learning and Inference Methodology

LPT is trained via approximate maximum likelihood using a variational lower bound (ELBO):

R=t=1Hr(st,at)R = \sum_{t=1}^H r(s_t,a_t)8

  • The variational posterior R=t=1Hr(st,at)R = \sum_{t=1}^H r(s_t,a_t)9 is set to zz0 and approximated by MCMC (Langevin dynamics) samples.
  • Posterior sampling is performed by running zz1 Langevin steps for each zz2, using gradients of the log-joint.

Parameter updates for zz3 are computed using gradients through sampled latent plans, aggregated across minibatches with typical learning rates of zz4 to zz5 and weight-decay/dropout regularization.

Persistent MCMC chains per data point enable amortized posterior approximation across epochs.

4. Test-Time Planning as Latent Space Inference

At deployment, LPT realizes “planning as inference.” For a target return zz6:

  1. Sample zz7 from the posterior conditioned on zz8,

zz9

using Langevin dynamics, bypassing the pθ(τ,R,z)=pα(z)  pβ(τz)  pγ(Rz)p_\theta(\tau, R, z) = p_\alpha(z)\; p_\beta(\tau | z)\; p_\gamma(R | z)0-likelihood term for efficiency.

  1. Decode pθ(τ,R,z)=pα(z)  pβ(τz)  pγ(Rz)p_\theta(\tau, R, z) = p_\alpha(z)\; p_\beta(\tau | z)\; p_\gamma(R | z)1.
  2. Roll out an episode by autoregressively sampling pθ(τ,R,z)=pα(z)  pβ(τz)  pγ(Rz)p_\theta(\tau, R, z) = p_\alpha(z)\; p_\beta(\tau | z)\; p_\gamma(R | z)2 at each timestep, feeding actions to the environment.

Empirical evidence indicates that trajectories generated in this way concentrate around desired returns.

5. Empirical Evaluation and Comparative Results

LPT was evaluated on several offline RL domains with rewards fully hidden except for final returns:

Gym-MuJoCo (D4RL) Benchmarks

Task DT Q-DT LPT (Final-Return)
halfcheetah-medium 42.4 ± 0.5 42.4 43.1 ± 0.4
walker2d-medium-replay 51.6 ± 24.6 29.6 72.3 ± 1.9

Maze2D (Delayed Reward)

Task DT QDT LPT
umaze 31.0 ± 21.3 57.3 ± 8.2 57.4 ± 2.9
medium 8.2 ± 4.4 13.3 ± 5.6 20.6 ± 1.8
large 2.3 ± 0.9 31.0 ± 19.8 22.6 ± 1.9

On medium-sized mazes, LPT outperforms both DT and Q-DT, indicating strong long-range credit assignment and the capacity for trajectory stitching.

Connect Four

Average return (win = 1, draw = 0, lose = –1):

  • DT: pθ(τ,R,z)=pα(z)  pβ(τz)  pγ(Rz)p_\theta(\tau, R, z) = p_\alpha(z)\; p_\beta(\tau | z)\; p_\gamma(R | z)3
  • LPT: pθ(τ,R,z)=pα(z)  pβ(τz)  pγ(Rz)p_\theta(\tau, R, z) = p_\alpha(z)\; p_\beta(\tau | z)\; p_\gamma(R | z)4

LPT nearly solves the environment and exhibits robustness to adversarial, stochastic opponents.

6. Insights: Credit Assignment, Trajectory Stitching, and Contingency Adaptation

  • Credit Assignment: Gradients with respect to pθ(τ,R,z)=pα(z)  pβ(τz)  pγ(Rz)p_\theta(\tau, R, z) = p_\alpha(z)\; p_\beta(\tau | z)\; p_\gamma(R | z)5 integrate information from all finite-context subtrajectories, enforcing trajectory-wide abstraction.
  • Trajectory Stitching: LPT can synthesize improved trajectories by combining segments from otherwise sub-optimal demonstrations, as observed in Maze2D visualizations.
  • Contingency Adaptation: By fixing pθ(τ,R,z)=pα(z)  pβ(τz)  pγ(Rz)p_\theta(\tau, R, z) = p_\alpha(z)\; p_\beta(\tau | z)\; p_\gamma(R | z)6 prior to rollout and eschewing stepwise signals, LPT demonstrates lower susceptibility to overfitting on stochastic environmental artifacts compared to DT.

7. Significance and Implications

By reframing trajectory generation as latent-variable inference, LPT circumvents the need for explicit reward sequences and supports robust temporally consistent plan execution from offline data. Empirical results affirm its efficacy as a general framework for planning under severely delayed, aggregate return feedback. This suggests latent variable approaches provide a viable alternative to step-wise reward prompting in sequence modeling for offline RL (Kong et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Latent Plan Transformer (LPT).