PaD: Energy-Based Trajectory Planning
- The paper introduces an energy-based framework that refines future trajectories via gradient descent, ensuring goal-consistent and dynamically plausible plans.
- It employs a unified training and inference approach using hindsight goal relabeling and a denoising loss to mitigate train-test mismatches typical in RL.
- Experimental results on OGBench cube manipulation tasks show 95–98% success and improved efficiency compared to classical planners and policy-based methods.
Planning as Descent (PaD) is an offline goal-conditioned trajectory synthesis framework grounded in energy-based modeling of entire future trajectories. Rather than explicitly learning a policy or a forward model for planning, PaD formulates the planning problem as direct gradient-based refinement in a learned latent energy landscape, where the energy function assigns low values to feasible, goal-consistent trajectories. Training and inference share identical computation by leveraging a denoising-style objective aligned with the planning procedure, thus mitigating the common train-test mismatch in model-based and policy-based reinforcement learning approaches. PaD empirically demonstrates superior performance on OGBench cube manipulation tasks, especially under data regimes with narrow expert or broad noisy demonstrations, and provides a principled alternative to both policy learning and classical search-based planners (García et al., 19 Dec 2025).
1. Mathematical Formulation of the Energy-Based Trajectory Objective
PaD operates over trajectories encoded into a latent space using an encoder , where is mapped to . The planning context consists of:
- Past window:
- Future Latent Trajectory:
- Goal state: Encoded as
- Normalized time-to-reach: specifying desired goal timing within horizon
PaD defines a scalar energy function: Minimizing the energy with respect to synthesizes a trajectory that is both dynamically plausible (conditioned on the past) and goal-consistent (reaching the desired goal around fraction of the horizon).
2. Training with Hindsight Goal Relabeling and Denoising Loss
PaD utilizes a self-supervised denoising-style objective paralleling inference-time refinement to sculpt the energy landscape. The central training operations include:
- Hindsight relabeling with temporal targets: A cut time is chosen in a demonstration trajectory, the first states define the past. is sampled; index designates the goal .
- Latent corruption: Clean future latents are computed, then perturbed by
with , .
- Iterative gradient descent refinement:
where projects to the encoder manifold, and is a learnable step size.
- Denoising loss:
$\mathcal{L} = \sum_{t=1}^{T} \mathrm{smooth\mbox{-}L_1} (z_\mathrm{future}^{(t)}, z_\mathrm{clean})$
This loss, with stop-gradient through between steps, aligns training with iterative refinement and eliminates the need for explicit negatives or partition function estimates.
3. Inference and Planning via Gradient-Based Latent Trajectory Descent
At test time, given observed past states and a goal:
- Batch initialization: Encode , sample candidate and .
- Gradient-based refinement:
- Scoring and selection: Final energies are computed, top- candidates with lowest energy are retained. Sampling among these is biased toward lower (higher efficiency).
- Action execution: Chosen plan is passed through an inverse dynamics model to generate executable actions; after steps, the process replans from the current state.
4. Experimental Evaluation on OGBench Cube Manipulation
PaD was evaluated on the OGBench “cube-single” manipulation tasks, considering two data regimes:
| Regime | Data Source | PaD Success | Baseline (GCIQL) Success | PaD Efficiency (steps) |
|---|---|---|---|---|
| Expert | cube-single-play-v0 (expert) | 95 ± 2% | 68 ± 6% | 78 ± 7 |
| Noisy | cube-single-noisy-v0 (noisy) | 98 ± 2% | 99 ± 1% | 63 ± 6 |
Efficiency is measured as the mean number of steps to solve in successful rollouts. Training on noisy, suboptimal data produced shorter test-time solutions than training on expert data, suggesting that broader coverage allows refinement toward more direct plans than found in narrow expert demonstrations (García et al., 19 Dec 2025).
5. Comparative Analysis: Energy-Based Planning versus Policy Learning and Model Predictive Control
PaD’s energy-based approach mitigates the train–test mismatch typical of model-based and policy-based RL:
- Jointly learned verifier and planner: The energy gradients used for test-time planning are trained for denoising, ensuring that optimization behavior is supported by the energy landscape. There is no regime in which adversarial plans succeed under the model but fail in reality, as the planner is never trained to accept such solutions.
- No explicit model rollouts: By eschewing forward simulation with an imperfect dynamics model, PaD avoids the compounding of model errors and adversarial trajectory exploitation.
- Explicit goal conditioning and temporal bias: Hindsight relabeling with a “time-to-reach” variable focuses supervision on both goal attainment and plan efficiency, which is typically unavailable to generative or diffusion-based approaches.
Sample-based generative models (such as diffusion models or transformers) reproduce empirical trajectory distributions but lack explicit feasibility verification and often overfit predominant but suboptimal behavioral patterns in their training data. In contrast, PaD’s energy function is trained to verify and refine candidate solutions directly through descent, centering planning on the evaluation of goal-conditioned feasibility (García et al., 19 Dec 2025).
6. Connections to Other Planning-as-Descent Paradigms
Related approaches, such as Stein Task and Motion Planning (STAMP), generalize the “planning as descent” principle to hybrid discrete–continuous robotics problems using Stein Variational Gradient Descent in continuous relaxations of symbolic plans. While PaD operates by learning a latent energy landscape over full trajectories and using denoising-style objectives, STAMP employs gradient-based variational inference over both symbolic and continuous planning parameters, leveraging differentiable physics for plan refinement (Lee et al., 2023). Both frameworks unify planning and verification under gradient flow, but PaD’s core innovation is the joint learning and utilization of a goal-conditioned latent trajectory energy for direct, model-free verification-driven synthesis.
7. Implications and Position in the Planning Landscape
PaD occupies an intermediate conceptual role between direct policy learning and explicit search-based planning. It avoids the direct state-action mapping (with associated generalization errors) of policy learning, and circumvents the modeling and compounding error pitfalls of classical planners dependent on imperfect simulators. The verification-driven refinement process, regularized by denoising and hindsight, yields a robust, train–test aligned paradigm for offline, reward-free planning in complex, high-dimensional domains. This suggests a broader applicability of energy-based objectives and descent-based synthesis for controllable trajectory generation across reinforcement learning and robotics (García et al., 19 Dec 2025).