Factored Subgoal Diffusion Model

Updated 4 February 2026

The paper introduces a novel generative planning framework that employs conditional diffusion to produce factored subgoals based on state, goal, and contextual conditioning.
It leverages both entity and temporal factorization with neural denoising networks to achieve adaptive subgoal resolution in reinforcement learning and robot manipulation tasks.
Empirical results show significant gains in long-horizon, multi-agent scenarios by decoupling high-level planning from low-level control through robust subgoal filtering and stitching.

A factored subgoal-generating conditional diffusion model is a class of generative planning framework that applies diffusion probabilistic modeling techniques to generate temporally and/or entity-factored subgoals, conditioned on the current state, final goal, and other relevant context. The resulting subgoals guide model-based or hierarchical control in challenging reinforcement learning and robot manipulation environments, particularly those characterized by long horizons, multiple entities, and sparse or delayed reward feedback. This modeling paradigm has been instantiated in various forms, including entity-factorized models for multi-agent or multi-object scenarios (Haramati et al., 2 Feb 2026), coarse-to-fine factorizations for adaptive subgoal resolution (Huang et al., 2024), and sub-trajectory stitching for offline goal-conditioned RL (Kim et al., 2024).

1. Mathematical Formulation and Factorization Principles

Factored subgoal-generating conditional diffusion models extend the diffusion modeling paradigm to structured planning. Given a planning problem with state space $\mathcal{S}$ , goal space $\mathcal{G}$ , and possibly an object/entity factorization $s = \{s^{(1)}, s^{(2)}, ..., s^{(N)}\}$ or temporal factorization (sub-trajectories, multi-level subgoal chains), the aim is to model $p(\tilde{g} \mid s, g)$ , where $\tilde{g}$ is a (potentially factored) immediate subgoal or subgoal sequence conditioned on the current state $s$ and desired goal $g$ .

The diffusion process comprises:

A forward noising process that iteratively perturbs subgoals or sub-trajectory tokens according to

$q(\tilde{g}_t^{(m)} \mid \tilde{g}_{t-1}^{(m)}) = \mathcal{N}\left(\tilde{g}_t^{(m)}; \sqrt{1-\beta_t}\,\tilde{g}_{t-1}^{(m)}, \beta_t I\right)$

for each factor (entity $m$ or subgoal chain position).

A reverse (denoising) process, parameterized by a neural network (U-Net, Transformer, etc.), that reconstructs plausible subgoal candidates conditioned on contextual information. For example,

$p_\theta(\tilde{g}_{t-1} \mid \tilde{g}_t, s, g) = \prod_{m=1}^N \mathcal{N}\left(\tilde{g}_{t-1}^{(m)}; \mu_\theta^{(m)}(\tilde{g}_t, s, g, t), \beta_t I\right)$

Factoring can be along entities (e.g., for each object or agent) (Haramati et al., 2 Feb 2026), or time (e.g., a chain of temporally ordered subgoals) (Huang et al., 2024), or both.

This design enables modeling of structured, high-dimensional goal-reaching tasks without necessitating explicit modeling of all joint interactions in a monolithic fashion.

2. Conditioning Mechanisms and Network Architectures

Conditioning in these models is implemented by embedding contextual variables—current state, final goal, coarser subgoals, or scene latent variables—into vectors that are injected into the diffusion denoiser. Typical approaches include:

For entity-factored diffusion (Haramati et al., 2 Feb 2026): Each entity token is embedded with role (state, goal, noised subgoal), view, and timestep information, and concatenated before processing with multi-layer Transformers (e.g., 8 layers, 8 heads, hidden size 256).
For coarse-to-fine temporal subgoal chains (Huang et al., 2024): Current state, final goal, previous level’s coarser subgoals, and optional scene descriptors are mapped via MLPs/CNNs to a common embedding space and injected into a temporal U-Net via FiLM (feature-wise linear modulation).
For sub-trajectory factorizations (Kim et al., 2024): Goal and value encoding tokens are prepended to the tokenized sub-trajectory sequence and processed by a condition-prompted Transformer-based U-Net.

The denoising network predicts the injected noise for each factor, with the noise prediction loss summed across diffusion steps and factors:

$L(\theta) = \mathbb{E}_{(s, g, \tilde{g}_0), t, \epsilon} \left[ \sum_{m=1}^N \left\| \epsilon - \epsilon_\theta^{(m)}(...) \right\|^2 \right]$

3. Subgoal Generation, Factored Planning, and Adaptive Resolution

The factored subgoal generation process supports both spatial and temporal decomposition:

Entity-Factored Generation: In hierarchical entity-centric frameworks (Haramati et al., 2 Feb 2026), each entity’s subgoal $\tilde g^{(m)}$ is generated independently but with cross-entity attention, allowing the model to capture interdependencies while scaling to environments with many entities. At test time, $M$ diverse subgoal samples are drawn and filtered/selected via a value function to ensure reachability and maximal progress toward the global goal.
Coarse-to-Fine Temporal Decomposition: Subgoal Diffuser (Huang et al., 2024) recursively generates subgoal chains of increasing length and resolution. Given coarse subgoals, reachability estimators predict the minimum controller steps required between adjacent subgoals. If segments are too hard (exceeding an MPC horizon), the sequence is refined by inserting additional subgoals focused on challenging regions, as determined by upsampling and redistributing subgoal points in latent space.
Sub-Trajectory Stitching: In SSD (Kim et al., 2024), long trajectories are decomposed into overlapping $h$ -step sub-trajectories. The diffusion model generates goal-conditioned sub-trajectories, which are stitched together recursively at test time to construct full plans.

These approaches allow adaptive determination of the number and locations of subgoals, yielding improved sample efficiency and planning performance, especially in problems where the low-level agent has limited reachability or where the environment exhibits multi-modal complexity.

4. Integration with Control and Reinforcement Learning Agents

The generated subgoals serve as intermediate objectives for low-level controllers or policies, bridging the gap between long-horizon planning and myopic control:

Hierarchical RL (HECRL, (Haramati et al., 2 Feb 2026)): The high-level planner generates subgoals via the factored subgoal diffusion model, filtered by a thresholded value function to ensure subgoals are within the “competence radius” of the low-level agent. The selected subgoal is then pursued by a goal-conditioned IQL (Implicit Q-Learning) policy for a fixed subgoal horizon $T_{sg}$ .
Model Predictive Control (Subgoal Diffuser, (Huang et al., 2024)): Chains of adaptively sampled subgoals are fed to an MPPI (Model Predictive Path Integral) controller with a fixed horizon. Reachability estimators dynamically control the subgoal granularity. Replanning is triggered regularly to handle disturbances; if progress stalls, the subgoal chain is refined by increasing its resolution.
Offline Goal-Conditioned RL (SSD, (Kim et al., 2024)): The diffusion model proposes $h$ -step plans from any state, recursively stitching these segments to reach the final goal. At each step, the policy samples a new sub-trajectory plan conditioned on the updated state and goal; actions are executed for $k \leq h$ steps before repeating the process.

This decoupling of high-level planning and low-level execution enables modular algorithm design and facilitates generalization to domains with complex combinatorial structure.

5. Training Objectives, Hyperparameters, and Practical Details

Training adopts the denoising score matching (DSM) or simplified DDPM objective, typically with a uniform sampling of diffusion steps and injected Gaussian noise. Key architectural and training hyperparameters reported in the literature include:

Paper	Steps $T$	Network Architecture	Token Dim	Conditioning	Value Filtering/Selector
(Haramati et al., 2 Feb 2026)	10	8-layer Transformer, 8 heads	32–64	Role, timestep, view	Value-threshold, progress
(Huang et al., 2024)	8–12	Temporal U-Net (4-level, FiLM)	256	State, goal, prior SDF	Reachability estimator
(Kim et al., 2024)	N/A	Prompted U-Net, transformer blk	$d \sim 64$	Goal, value	N/A (via sub-trajectory Q)

Optimization typically uses Adam, batch sizes 512, and up to $10^6$ updates. Subgoal sample size at inference is $M=64$ –$256$ for filtering-based selection.

6. Empirical Performance and Ablation Results

Factored subgoal-generating conditional diffusion models demonstrate significant improvements over prior non-factorized or non-adaptive planning methods in long-horizon, high-dimensional settings:

In multi-entity image-based tasks with sparse reward, HECRL achieves over 150% higher success rates than baseline GCRL agents on the hardest tasks and generalizes across increasing task horizons and entity counts (Haramati et al., 2 Feb 2026).
For long-horizon manipulation (rope reconfiguration, notebook manipulation), Subgoal Diffuser yields lower minimum final distance (e.g., $2.2 \pm 0.9$ vs. $7.6 \pm 1.7$ on rope) compared to both prior diffusion-policy and decision-diffuser baselines (Huang et al., 2024).
Ablations confirm that adaptive subgoal resolution and coarse-to-fine redistribution are critical, yielding gains of up to 30% over fixed-resolution ablations. On the notebook task, omission of coarse-to-fine modeling degrades performance to $9.5 \pm 9.6$ from $1.6 \pm 2.4$ .
In SSD, generated sub-trajectories enable successful stitching of suboptimal segments from offline datasets, resulting in state-of-the-art performance on standard GCRL benchmarks (Kim et al., 2024).

A plausible implication is that the adaptive, factored architecture is especially important in environments where single-step or monolithic plan generation is infeasible due to combinatorial entity structure or highly non-uniform task difficulty along the trajectory.

7. Connections, Significance, and Current Limitations

Factored subgoal-generating conditional diffusion models represent a converging point between generative temporal modeling, hierarchical and entity-centric reinforcement learning, and large-scale offline planning. By leveraging the denoising diffusion paradigm, they afford robust, multi-modal, and distributionally calibrated subgoal proposals, supporting modular decoupling between high-level planning and low-level control.

The fundamental strengths are:

Scalability in multi-entity or high-dimensional state spaces via factorization
Flexibility in supporting adaptive granularity of plans (coarse when possible, fine where needed)
Modular integration with standard RL or MPC agents via value- or reachability-based filtering

Identified limitations include:

The need for reliable value/reachability estimators to ensure selected subgoals are feasible for the low-level agent, which constrains the sample efficiency and practical applicability.
The complexity of multi-level architectures and associated computational costs.

Current research advances the direction of entity- and temporally-factored subgoal diffusion, but open challenges remain in further scaling, compositionality, and transfer across diverse domains (Haramati et al., 2 Feb 2026, Huang et al., 2024, Kim et al., 2024).

Markdown Upgrade to Chat

References (3)

Hierarchical Entity-centric Reinforcement Learning with Factored Subgoal Diffusion (2026)

Subgoal Diffuser: Coarse-to-fine Subgoal Generation to Guide Model Predictive Control for Robot Manipulation (2024)

Stitching Sub-Trajectories with Conditional Diffusion Model for Goal-Conditioned Offline RL (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Factored Subgoal-Generating Conditional Diffusion Model.