Unified Diffusion Planners in Sequential Decision Making

Updated 7 January 2026

Unified Diffusion Planners are a class of generative models that use denoising diffusion processes to parameterize and generate entire trajectories, unifying multi-task decision-making.
They integrate hierarchical and multiscale architectures with guided sampling techniques to enhance performance in offline RL, motion synthesis, and autonomous driving.
Empirical evaluations demonstrate significant improvements in normalized rewards and success rates compared to traditional methods, highlighting their scalability and robustness.

Unified Diffusion Planners are a class of algorithmic frameworks for sequential decision-making and trajectory generation, leveraging denoising diffusion probabilistic models (DDPMs) to unify high-dimensional, long-horizon, and often multi-task planning under one coherent generative modeling paradigm. These methods subsume and extend classical behavioral cloning, model-based control, and reinforcement learning (RL) by parameterizing entire trajectories (or temporally abstract plans) as samples from a flexible, tractable diffusion process, often equipped with sophisticated hierarchical structures, guided sampling, and adaptive inference-time control. Unified Diffusion Planners have established state-of-the-art performance and scalability for offline RL, goal-conditioned planning, compositional generalization, motion synthesis, and autonomous driving.

1. Mathematical Foundations and Core Principles

Unified Diffusion Planners formulate the sequential decision process as inference in a trajectory-level generative model. Given a Markov Decision Process (MDP) with state space $\mathcal{S}$ , action space $\mathcal{A}$ , horizon $H$ , and trajectory $\tau = ((s_0, a_0), ..., (s_H, a_H))$ , the planning objective is to generate high-return trajectories according to $J(\tau) = \sum_{k=0}^H r(s_k, a_k)$ , where $r$ is the reward function (Hao et al., 12 May 2025). The diffusion process is constructed as a Markov chain in trajectory space:

Forward ("noising") process: At each diffusion step $t$ ,

$q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t}x_{t-1}, \beta_t I),$

where $x_t$ denotes the noisy trajectory at step $t$ and $\beta_t$ is a fixed noise schedule (Lu et al., 1 Mar 2025).

Reverse ("denoising") process: A neural network parameterizes

$p_{\theta}(x_{t-1}|x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t)),$

where typically $\mu_\theta$ is trained to predict the noise injected in the forward process (score matching or $\epsilon$ -prediction loss) (Hao et al., 12 May 2025).

Inference (planning) is performed by sampling from $p_\theta(x_0)$ starting from $x_T \sim \mathcal{N}(0, I)$ . The model can be conditioned or guided to generate reward-maximizing or goal-achieving trajectories by classifier guidance, value-function gradients, or other auxiliary objectives.

2. Hierarchical and Multiscale Architectures

Unified Diffusion Planners increasingly adopt hierarchical decompositions to address long-horizon, high-dimensional planning, mitigating issues such as error accumulation and limited receptive field. A prototypical hierarchical planner includes:

High-Level (HL) Diffusion: Samples coarse subgoals $g = \{g_i\}$ (e.g., waypoints or temporally abstract subgoals) with a dedicated diffusion process,
Low-Level (LL) Diffusion: For each segment, a separate or factored diffusion model generates fine-grained trajectory segments $x_{1:N}$ conditioned on the corresponding subgoals (Hao et al., 12 May 2025, Chen et al., 2024, Chen et al., 25 Mar 2025).

CHD (Coupled Hierarchical Diffusion) models HL subgoals and LL trajectories jointly in a unified diffusion process, passing LL feedback upstream via a shared classifier for segment-wise coherence (Hao et al., 12 May 2025). The Hierarchical Multiscale Diffuser (HM-Diffuser) recursively reuses the same diffusion model across levels, enabling scalable, extendable long-horizon planning through progressive trajectory extension (PTE), where short segments are iteratively stitched into long ones and the model is conditioned on a level index specifying temporal resolution (Chen et al., 25 Mar 2025).

The following table summarizes key hierarchical diffusion planning models:

Model	HL–LL Decoupling	Multiscale Recursion	Trajectory Extension	Key Innovation
CHD (Hao et al., 12 May 2025)	Joint (coupled)	No	No	Classifier-guided HL–LL
HD (Chen et al., 2024)	Separate (parallel)	No	No	Efficient jumpy planning
HMD (Chen et al., 25 Mar 2025)	Recursive (single)	Yes	Yes (PTE)	Unified multiscale model
SCoTS+HD (Lee et al., 1 Jun 2025)	Separate	No	Yes	Augmentation via stitching

3. Guided Sampling, Trajectory Stitching, and Task Adaptation

Guidance mechanisms are central to Unified Diffusion Planners:

Classifier Guidance/Reward Gradients: At every denoising step, the Gaussian mean is shifted by the gradient of a classifier or reward estimator, e.g.,

$\tilde\mu = \mu_\theta(x_t, t) + \lambda \Sigma_\theta \nabla_{x_t}\log C_\phi(x_t)$

(Hao et al., 12 May 2025, Liang et al., 2023, Zheng et al., 26 Jan 2025). This steers generation toward high-reward or task-desirable samples.

Monte Carlo Sample Selection (MCSS): Drawing $N$ independent unconditional samples and selecting the highest-critic-scoring trajectory can outperform guided sampling when expert behavior is sufficiently represented in the dataset (Lu et al., 1 Mar 2025).
Prior Guidance (PG): Replaces the standard Gaussian prior over initial noise with a learnable distribution optimized to generate higher-value trajectories, yielding efficient one-sample inference (2505.10881).

Trajectory extension and augmentation address data limitations and long-horizon generalization:

State-Covering Trajectory Stitching (SCoTS): Builds long, diverse trajectories by learning a temporal distance–preserving latent space and incrementally stitching offline segments to maximize latent coverage, subsequently training diffusion planners on the augmented dataset (Lee et al., 1 Jun 2025).
Progressive Trajectory Extension (PTE): Iterative bridging of short segments with learned diffusion-based "stitchers," enabling planning for horizons vastly exceeding the original dataset's length (Chen et al., 25 Mar 2025).

The AdaptDiffuser closes the "generation-guidance-adaptation" loop: diffusion sampling is reward-gradient guided, refined by a discriminator to reject poor trajectories, and the model is iteratively fine-tuned on accepted synthetic rollouts, improving adaptation and generalization to novel tasks (Liang et al., 2023).

4. Extensions: Variable Horizon, Latent Actions, Multi-Task, and Control Integration

Unified Diffusion Planners have been extended on several orthogonal axes:

Variable Horizon Control: VH-Diffuser introduces a Length Predictor trained with hybrid temporal/geometric supervision, controlling the sampling process to adaptively output instance-specific trajectory lengths. Training on random-length sub-trajectories allows robust generation across unseen horizon requirements, without backbone architectural change (Liu et al., 15 Sep 2025).
Latent Action Space Planning: LatentDiffuser replaces the action space with continuous skills or latent actions discovered via a VAE. Planning leverages energy-guided diffusion sampling, theoretically establishing equivalence to KL-regularized RL objectives. An auxiliary network enables exact guidance at each marginal reverse step (Li, 2023).
Unified Planner-Controller: UniPhys merges planning and physics-based control by training a single diffusion model to denoise noisy motion histories and handle simulation drift, robustly generating long-horizon, physically plausible behaviors from text/goal/trajectory input ("Diffusion Forcing paradigm") (Wu et al., 17 Apr 2025).
Multi-Task and Task Adaptation: SODP adopts task-agnostic pre-training (on multi-task, sub-optimal demonstration data) followed by RL-based reward-guided fine-tuning, using importance-weighted gradients and behavior-cloning regularization to avoid catastrophic forgetting and rapidly adapt to new downstream tasks (Fan et al., 2024).

5. Empirical Evaluations and Comparative Performance

Unified Diffusion Planners have demonstrated superior empirical performance across diverse benchmarks:

Long-Horizon Navigation (Maze2D, AntMaze): CHD achieves ∼157.1 normalized reward, outperforming flat and prior hierarchical diffusion baselines (+17.2 points, ∼12%) (Hao et al., 12 May 2025). SCoTS+HD achieves average success rates ∼96.8% in goal-conditioned stitching and exploration settings, outscoring alternate planners and classic RL (Lee et al., 1 Jun 2025).
Manipulation and Robotics: SODP surpasses prior multi-task baselines on Meta-World MT50-rand (60.6% vs. Prompt-DT 48.4%) and Adroit (73.9% vs. DiffusionPolicy 31.7%) (Fan et al., 2024). AdaptDiffuser demonstrates +20.8% improvement in Maze2D and notable gains in zero-shot KUKA pick-and-place (Liang et al., 2023).
Autonomous Driving: Diffusion Planner achieves closed-loop scores 89.19–92.08 (non-reactive mode) on nuPlan and real-world driving datasets, matching or surpassing learning- and rule-based baselines, with classifier guidance further reducing collision rates (Zheng et al., 26 Jan 2025).
Motion Synthesis: UniPhys attains high motion naturalness and success rates without task-specific fine-tuning (e.g., text-driven naturalness 3.23 vs. next-best 2.86) (Wu et al., 17 Apr 2025).

A summary table of selected metric improvements:

Task/Domain	Baseline	Unified Diffusion Variant	Performance Metric	Relative Gain
Maze2D U-Maze	Flat Diffuser	CHD (Hao et al., 12 May 2025)	Norm. reward 119.5→157.1	+31.4%
Maze2D Stitch+Explore	HD	SCoTS+HD (Lee et al., 1 Jun 2025)	Avg. Success 25.4→96.8%	∼4×
Meta-World MT50-rand	DiffPoints	SODP (Fan et al., 2024)	Success 48.7→60.6%	+12%
AntMaze Giant	Diffuser	HMD-X (Chen et al., 25 Mar 2025)	Score 0→82.1	--
nuPlan (Autonomous driving)	PlanTF	DiffusionPlanner (Zheng et al., 26 Jan 2025)	Score 85.62→89.19	+4.2

6. Limitations and Future Directions

Unified Diffusion Planners face several open challenges:

Manual Segmentation & Granularity: CHD and related hierarchical planners rely on hand-tuned segmentation (fixed $N$ , $h$ ), with uniform segment lengths that may not match variable temporal granularity in complex tasks (Hao et al., 12 May 2025). Adaptive or end-to-end subgoal discovery remains unsolved.
Coverage & Data Diversity: Planner performance is coupled to the coverage of the offline data. Methods like SCoTS, PTE, and reward-guided fine-tuning (SODP) address this but bring their own reliability constraints when modelled creativity exceeds real environment feasibility (Lee et al., 1 Jun 2025, Fan et al., 2024).
Out-of-Distribution Robustness & Classifier Reliability: Classifier and value-guided diffusion can misdirect sampling into distributional drift, particularly in highly multi-modal or poorly covered tasks (Hao et al., 12 May 2025, 2505.10881). Unconditional sampling with selection (MCSS) may be preferred when expert data is abundant (Lu et al., 1 Mar 2025).
Computational Cost & Scalability: Tree-based innovations (MCTD (Yoon et al., 11 Feb 2025)) and efficient hierarchical parallelization reduce wall-clock and memory costs, but general large-scale deployment in real-time, interactive systems remains a barrier.

Future work is focused on adaptive segmentation, integrating perception for end-to-end planning from raw observations, safety and constraint-aware guidance (e.g., collision/kinematics constraints), learning latent/hierarchical structures in a fully data-driven way, and further unifying planning and world-model-based policy learning (Hao et al., 12 May 2025, Chen et al., 25 Mar 2025, Wu et al., 17 Apr 2025).

7. Variants, Design Choices, and Best Practices

Comprehensive empirical studies highlight several critical design trade-offs:

Backbone architecture: Transformer denoisers (DiT1D) surpass U-Nets in planning tasks, yielding higher normalized return with fewer parameters and faster inference (Lu et al., 1 Mar 2025).
Guided sampling vs. selection: In expert-rich datasets, unconditional sampling with critic-based selection (MCSS) achieves or exceeds the performance of classifier-guided approaches, avoiding guidance bias and parameter tuning (Lu et al., 1 Mar 2025).
Action parameterization: Planning in state space followed by inverse-dynamics mapping to actions yields robust performance in high-dimensional environments (Lu et al., 1 Mar 2025).
Planning stride and parallelization: Jumpy (coarse) strides and efficient parallelization at the hierarchy's LL facilitate larger receptive fields and faster generation (Chen et al., 2024, Chen et al., 25 Mar 2025).
Planning horizon: Variable-horizon training (VHD) achieves high robustness to instance-specific task complexity, reducing both failures due to mismatched length and unnecessary detours (Liu et al., 15 Sep 2025).

Through these innovations, Unified Diffusion Planners represent the convergence of hierarchical modeling, probabilistic denoising, adaptive inference, and guided learning, establishing a new paradigm for scalable, flexible, and generalizable sequential decision making across robotics, RL, motion planning, and autonomous control domains.