Papers
Topics
Authors
Recent
2000 character limit reached

PaD: Energy-Based Trajectory Planning

Updated 26 December 2025
  • The paper introduces an energy-based framework that refines future trajectories via gradient descent, ensuring goal-consistent and dynamically plausible plans.
  • It employs a unified training and inference approach using hindsight goal relabeling and a denoising loss to mitigate train-test mismatches typical in RL.
  • Experimental results on OGBench cube manipulation tasks show 95–98% success and improved efficiency compared to classical planners and policy-based methods.

Planning as Descent (PaD) is an offline goal-conditioned trajectory synthesis framework grounded in energy-based modeling of entire future trajectories. Rather than explicitly learning a policy or a forward model for planning, PaD formulates the planning problem as direct gradient-based refinement in a learned latent energy landscape, where the energy function assigns low values to feasible, goal-consistent trajectories. Training and inference share identical computation by leveraging a denoising-style objective aligned with the planning procedure, thus mitigating the common train-test mismatch in model-based and policy-based reinforcement learning approaches. PaD empirically demonstrates superior performance on OGBench cube manipulation tasks, especially under data regimes with narrow expert or broad noisy demonstrations, and provides a principled alternative to both policy learning and classical search-based planners (García et al., 19 Dec 2025).

1. Mathematical Formulation of the Energy-Based Trajectory Objective

PaD operates over trajectories encoded into a latent space using an encoder fθ:SRdf_\theta: \mathcal{S} \rightarrow \mathbb{R}^d, where sts_t is mapped to ztz_t. The planning context consists of:

  • Past window: zpast=(z0,,zk)z_\mathrm{past} = (z_0, \dots, z_k)
  • Future Latent Trajectory: zfuture=(zk+1,,zk+H)z_\mathrm{future} = (z_{k+1}, \dots, z_{k+H})
  • Goal state: Encoded as fθ(sg)f_\theta(s_g)
  • Normalized time-to-reach: λ[0,1]\lambda \in [0,1] specifying desired goal timing within horizon

PaD defines a scalar energy function: Eθ(zfuturezpast,sg,λ)RE_\theta\left(z_\mathrm{future} \mid z_\mathrm{past}, s_g, \lambda\right) \in \mathbb{R} Minimizing the energy with respect to zfuturez_\mathrm{future} synthesizes a trajectory that is both dynamically plausible (conditioned on the past) and goal-consistent (reaching the desired goal around fraction λ\lambda of the horizon).

2. Training with Hindsight Goal Relabeling and Denoising Loss

PaD utilizes a self-supervised denoising-style objective paralleling inference-time refinement to sculpt the energy landscape. The central training operations include:

  • Hindsight relabeling with temporal targets: A cut time PP is chosen in a demonstration trajectory, the first PP states define the past. λU(0,1)\lambda \sim \mathcal{U}(0,1) is sampled; index G=λ(LP)+PG = \lfloor \lambda (L-P) \rfloor + P designates the goal sg=sGs_g = s_G.
  • Latent corruption: Clean future latents zcleanz_\mathrm{clean} are computed, then perturbed by

zfuture(0)=βzclean+1βϵz_\mathrm{future}^{(0)} = \sqrt{\beta} z_\mathrm{clean} + \sqrt{1-\beta} \epsilon

with βU(0,1)\beta \sim \mathcal{U}(0,1), ϵN(0,I)\epsilon \sim \mathcal{N}(0, I).

zfuture(t+1)=pθ(zfuture(t)ηzfutureEθ(zfuture(t)zpast,sg,λ))z_\mathrm{future}^{(t+1)} = p_\theta \left( z_\mathrm{future}^{(t)} - \eta \nabla_{z_\mathrm{future}} E_\theta(z_\mathrm{future}^{(t)} \mid z_\mathrm{past}, s_g, \lambda) \right)

where pθp_\theta projects to the encoder manifold, and η\eta is a learnable step size.

  • Denoising loss:

$\mathcal{L} = \sum_{t=1}^{T} \mathrm{smooth\mbox{-}L_1} (z_\mathrm{future}^{(t)}, z_\mathrm{clean})$

This loss, with stop-gradient through zfuture(t)z_\mathrm{future}^{(t)} between steps, aligns training with iterative refinement and eliminates the need for explicit negatives or partition function estimates.

3. Inference and Planning via Gradient-Based Latent Trajectory Descent

At test time, given observed past states and a goal:

  • Batch initialization: Encode zpastz_\mathrm{past}, sample BB candidate zfuture(0,b)N(0,I)z_\mathrm{future}^{(0,b)} \sim \mathcal{N}(0, I) and λbU(0,1)\lambda_b \sim \mathcal{U}(0, 1).
  • Gradient-based refinement:

zfuture(t+1,b)=pθ(zfuture(t,b)ηzfutureEθ(zfuture(t,b)zpast,sg,λb))z_\mathrm{future}^{(t+1,b)} = p_\theta \left( z_\mathrm{future}^{(t,b)} - \eta \nabla_{z_\mathrm{future}} E_\theta(z_\mathrm{future}^{(t,b)} \mid z_\mathrm{past}, s_g, \lambda_b) \right)

  • Scoring and selection: Final energies EbE_b are computed, top-KK candidates with lowest energy are retained. Sampling among these is biased toward lower λ\lambda (higher efficiency).
  • Action execution: Chosen plan zplanz_\mathrm{plan} is passed through an inverse dynamics model to generate executable actions; after NN steps, the process replans from the current state.

4. Experimental Evaluation on OGBench Cube Manipulation

PaD was evaluated on the OGBench “cube-single” manipulation tasks, considering two data regimes:

Regime Data Source PaD Success Baseline (GCIQL) Success PaD Efficiency (steps)
Expert cube-single-play-v0 (expert) 95 ± 2% 68 ± 6% 78 ± 7
Noisy cube-single-noisy-v0 (noisy) 98 ± 2% 99 ± 1% 63 ± 6

Efficiency is measured as the mean number of steps to solve in successful rollouts. Training on noisy, suboptimal data produced shorter test-time solutions than training on expert data, suggesting that broader coverage allows refinement toward more direct plans than found in narrow expert demonstrations (García et al., 19 Dec 2025).

5. Comparative Analysis: Energy-Based Planning versus Policy Learning and Model Predictive Control

PaD’s energy-based approach mitigates the train–test mismatch typical of model-based and policy-based RL:

  • Jointly learned verifier and planner: The energy gradients used for test-time planning are trained for denoising, ensuring that optimization behavior is supported by the energy landscape. There is no regime in which adversarial plans succeed under the model but fail in reality, as the planner is never trained to accept such solutions.
  • No explicit model rollouts: By eschewing forward simulation with an imperfect dynamics model, PaD avoids the compounding of model errors and adversarial trajectory exploitation.
  • Explicit goal conditioning and temporal bias: Hindsight relabeling with a “time-to-reach” variable focuses supervision on both goal attainment and plan efficiency, which is typically unavailable to generative or diffusion-based approaches.

Sample-based generative models (such as diffusion models or transformers) reproduce empirical trajectory distributions but lack explicit feasibility verification and often overfit predominant but suboptimal behavioral patterns in their training data. In contrast, PaD’s energy function is trained to verify and refine candidate solutions directly through descent, centering planning on the evaluation of goal-conditioned feasibility (García et al., 19 Dec 2025).

6. Connections to Other Planning-as-Descent Paradigms

Related approaches, such as Stein Task and Motion Planning (STAMP), generalize the “planning as descent” principle to hybrid discrete–continuous robotics problems using Stein Variational Gradient Descent in continuous relaxations of symbolic plans. While PaD operates by learning a latent energy landscape over full trajectories and using denoising-style objectives, STAMP employs gradient-based variational inference over both symbolic and continuous planning parameters, leveraging differentiable physics for plan refinement (Lee et al., 2023). Both frameworks unify planning and verification under gradient flow, but PaD’s core innovation is the joint learning and utilization of a goal-conditioned latent trajectory energy for direct, model-free verification-driven synthesis.

7. Implications and Position in the Planning Landscape

PaD occupies an intermediate conceptual role between direct policy learning and explicit search-based planning. It avoids the direct state-action mapping (with associated generalization errors) of policy learning, and circumvents the modeling and compounding error pitfalls of classical planners dependent on imperfect simulators. The verification-driven refinement process, regularized by denoising and hindsight, yields a robust, train–test aligned paradigm for offline, reward-free planning in complex, high-dimensional domains. This suggests a broader applicability of energy-based objectives and descent-based synthesis for controllable trajectory generation across reinforcement learning and robotics (García et al., 19 Dec 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Planning as Descent (PaD).