Goal-Conditioned Diffusion Policy

Updated 11 June 2026

Goal-conditioned Diffusion Policy is a generative architecture that uses denoising diffusion models to produce multi-step action sequences conditioned on explicit goals.
It integrates direct input, prompt tokens, and feature-wise modulation to manage long-range dependencies and multi-modal behavior in robotic and control domains.
Empirical evaluations show state-of-the-art performance in humanoid locomotion, visual manipulation, and offline RL benchmarks with efficient sampling methods.

A goal-conditioned diffusion policy (DP) is a generative policy architecture that leverages denoising diffusion models to learn and sample multi-step sequences of actions or sub-trajectories conditioned explicitly on a goal specification. This framework has become central in both goal-conditioned imitation learning (GCIL) and offline/online goal-conditioned reinforcement learning (GCRL), due to its capacity to model multi-modal trajectory distributions, capture long-range dependencies, and generalize to diverse goal specifications across high-dimensional robotic and control domains. The formalism is rooted in Denoising Diffusion Probabilistic Models (DDPMs), with conditioning mechanisms tailored to inject observations and goals into both the forward (noising) and reverse (denoising) processes. Goal-conditioned DPs have achieved state-of-the-art results across a variety of domains, including whole-body humanoid loco-manipulation, multi-stage robotic manipulation, offline GCRL benchmarks, and vision-driven deformable object manipulation (Gu et al., 14 Mar 2026, Goswami et al., 7 Jun 2026, Kim et al., 2024, Bartsch et al., 2024).

1. Mathematical Formalism and Learning Objectives

At the core of a goal-conditioned DP lies the parameterization of the conditional policy as the reverse process of a Markovian forward noising chain. Let $A_t^0$ or $a^0$ denote the ground-truth action sequence (or chunk), $S_t$ the observation, and $g$ the explicit task goal.

The forward process typically follows

$q(A_t^k | A_t^{k-1}) = \mathcal{N}\Bigl(\sqrt{\alpha_k}\,A_t^{k-1}, (1-\alpha_k)I\Bigr),$

or, equivalently,

$A_t^k = \sqrt{\bar\alpha_k}\,A_t^0 + \sqrt{1-\bar\alpha_k}\,\epsilon,\qquad \epsilon\sim\mathcal{N}(0, I),$

where $\bar{\alpha}_k = \prod_{i=1}^k \alpha_i$ . The reverse (denoising) process learns $p_\theta(A_t^{k-1} | A_t^{k}, S_t, g)$ , parameterized as a time- and goal-conditioned Gaussian.

Training is performed via denoising score matching or maximum likelihood, using objectives such as

$L(\theta) = \mathbb{E}\left[ \|\epsilon - \epsilon_\theta(A_t^k, S_t, g, k)\|^2 \right],$

where $\epsilon_\theta$ is a neural network regressor predicting the noise at each diffusion step.

Some variants (e.g., Merlin) perform the diffusion in the state space rather than the action space, employing a single backward (denoising) step per environment step and reducing inference complexity (Jain et al., 2023). There exist both discrete-time (DDPM-based) and continuous-time (score-based SDE/ODE) instantiations (Reuss et al., 2023).

2. Goal Conditioning, Architectural Mechanisms, and Sampling

Goal conditioning is handled via multiple mechanisms:

Direct input concatenation: The goal $a^0$ 0 (and possibly other contextual embeddings) is concatenated with the observation and/or injected as additional tokens for Transformer-based policies (Gu et al., 14 Mar 2026, Kim et al., 2024, Goswami et al., 7 Jun 2026).
Prompt tokens: Some architectures prepend learned goal embedding vectors as dedicated tokens (e.g., Sub-trajectory Stitching with Diffusion, SSD) (Kim et al., 2024).
Feature-wise modulation (FiLM): The conditioning vector modulates the intermediate activations in each layer (Bartsch et al., 2024).

The neural backbone often comprises Transformer encoders or U-Net+Transformer hybrids to facilitate long-range temporal dependencies and cross-attention from actions to the encoded observation/goal (Gu et al., 14 Mar 2026, Goswami et al., 7 Jun 2026). Some architectures use 1D CNNs (e.g. SculptDiff) for short horizon action chunks within a goal-conditioned diffusion policy (Bartsch et al., 2024).

The sampling procedure typically follows the standard reverse-diffusion chain, repeatedly denoising a Gaussian-initialized action sequence with goal and state conditioning. Acceleration techniques such as DDIM and deterministic ODE sampling have been adopted for efficient inference, yielding action generation in far fewer denoising steps (e.g., BESO’s 3-step DDIM) compared to naive DDPM sampling (Reuss et al., 2023).

3. Hierarchical and Modular Integration

Goal-conditioned DPs are deployed in both flat and hierarchical policy architectures:

Hierarchical Control: DP generates high-level action commands (e.g., base velocity and Cartesian hand poses), which are executed by a separately parameterized low-level controller (typically an MLP-based or RL-trained policy) (Gu et al., 14 Mar 2026).
Model-predictive and subgoal-chaining: DPs are nested within higher-level world models or subgoal planners, receiving short-horizon subgoals as conditioning inputs and producing action sequences for each episode section (Goswami et al., 7 Jun 2026, Kim et al., 2024).

In hierarchical setups such as REFINE-DP, both high-level diffusion planner and low-level controller are jointly updated via alternating optimization—DP parameters are fine-tuned using RL (e.g., PPO gradients through diffusion steps), while the controller is trained to accurately track the evolving distribution of high-level commands, thus minimizing distributional mismatch (Gu et al., 14 Mar 2026).

4. Application Domains and Empirical Evaluation

Goal-conditioned DPs have demonstrated high performance across a variety of robotic and control domains:

Humanoid Loco-Manipulation: REFINE-DP yields over $a^0$ 1 success rates in complex, long-horizon, out-of-distribution simulation and real-world tasks, substantially outperforming pre-trained baselines and flattened RL-only fine-tuning (Gu et al., 14 Mar 2026).
Multi-stage Visual Manipulation: In WorldDP, object-centric DPs track subgoals from a learned high-level world model, with horizon and conditioning tailored to the stage granularity. Empirical ablations indicate superior performance for horizon $a^0$ 2 and criticality of subgoal-conditioning (Goswami et al., 7 Jun 2026).
Offline Goal-Conditioned RL Benchmarks: SSD attains state-of-the-art scores on D4RL GCRL tasks—Maze2D (Umaze: 144.6, Large: 183.5), Fetch Reach/Push/PickAndPlace—with robust sub-trajectory stitching and long-horizon planning (Kim et al., 2024).
Real-World Deformable Object Manipulation: SculptDiff (point-cloud-conditioned DP) achieves best-in-class geometric proximity to target shapes in real-world clay sculpting, outstripping both model-based and imitation learning baselines (Bartsch et al., 2024).
General GCIL Benchmarks: BESO achieves highest rewards/success rates on Block-Push and Relay-Kitchen (Block-Push C-BESO: $a^0$ 3, CALVIN vision goal chaining) in only 3 inference steps (Reuss et al., 2023). Merlin, with its single-step denoising per environment step, is both computationally efficient and empirically superior or competitive with BC or RL baselines across 10 tasks (Jain et al., 2023).

5. Extensions, Ablations, and Practical Considerations

Key ablation findings and deployment observations include:

Network Backbone: Transformer-based diffusion backbones outperform MLPs both pre-trained and with fine-tuning (Gu et al., 14 Mar 2026, Reuss et al., 2023).
Conditioning scheme: Absence of goal-conditioned inputs to the diffusion process severely impairs planning (WorldDP ablations) (Goswami et al., 7 Jun 2026).
Horizon trade-off: Too short a horizon impedes subgoal reachability, while too long leads to drifting from precise subgoal tracking; optimal $a^0$ 4 is domain- and hierarchy-dependent (Goswami et al., 7 Jun 2026).
Sample efficiency and speed: Score-based DPs with efficient ODE solvers (BESO) or single-step denoising (Merlin) yield 10–15 $a^0$ 5 inference speed-ups over full DDPM policies, while retaining success rates (Reuss et al., 2023, Jain et al., 2023).
Multi-modality: DP backbones capture multiple diverse valid behaviors per goal, essential for tasks with many viable solution strategies (Gu et al., 14 Mar 2026, Reuss et al., 2023).

6. Limitations and Future Directions

Current limitations of goal-conditioned DPs include:

Perception and embodiment: Most published DPs are conditioned on state-based (pose or MoCap) observations; direct vision-conditioned DPs or policies grounded in raw sensory input remain an active area (Gu et al., 14 Mar 2026, Bartsch et al., 2024).
Action/goal representation: Incorporation of latent skills or hierarchical abstractions holds promise for improving out-of-distribution generalization (Gu et al., 14 Mar 2026, Kim et al., 2024).
Temporal scaling: While sub-trajectory stitching (SSD) flexibly scales to very long-horizon tasks, slow sampling and over-reliance on the value estimator remain bottlenecks (Kim et al., 2024).
Force/interaction modeling: For deformable or contact-rich manipulation, lack of tactile/force feedback and non-physical gripper geometries limit expressivity (Bartsch et al., 2024).

Planned directions include fully vision-conditioned goal-embedding, online fine-tuning via limited (on-policy) rollouts, hierarchical DP stacks for compositional planning, and integration of uncertainty-aware value conditioning for robust sampling (Gu et al., 14 Mar 2026, Kim et al., 2024, Goswami et al., 7 Jun 2026).

References:

REFINE-DP: (Gu et al., 14 Mar 2026) WorldDP: (Goswami et al., 7 Jun 2026) SSD: (Kim et al., 2024) SculptDiff: (Bartsch et al., 2024) Merlin: (Jain et al., 2023) BESO: (Reuss et al., 2023)