Multi-Seed Dynamics-Aware Diffusion Policy

Updated 20 November 2025

The paper introduces a novel multi-seed diffusion framework that models diverse behavioral modes in O2O-RL, yielding significant improvements in locomotion and manipulation tasks.
It utilizes a U-Net backbone with Transformer-based state conditioning to integrate state trajectories, enabling distinct sub-policies from independent Gaussian noise seeds.
Diversity is enforced via dynamics-level KL regularization, ensuring physically meaningful action variations and robust generalization across different real-world scenarios.

The multi-seed dynamics-aware diffusion policy is a generative policy architecture and training strategy designed to address the challenges of multimodal behavior representation and distributional robustness in offline-to-online reinforcement learning (O2O-RL). It enables a single diffusion network to efficiently model a diverse ensemble of sub-policies, each corresponding to a distinct behavioral mode, through the use of multiple diffusion noise seeds. This framework incorporates explicit dynamics-level diversity regularization to ensure that the resulting action sequences (policies) are not only diverse but also physically meaningful, supporting enhanced generalization and applicability in robotic learning regimes (Huang et al., 13 Nov 2025).

1. Policy Network Architecture

The core architecture consists of a U-Net backbone augmented with Transformer-based cross-attention for state conditioning. The following components are central:

U-Net Backbone: Processes a noisy action sequence $x_t\in\mathbb{R}^{T \times d_a}$ at each diffusion step $t$ and produces a denoised estimate $\epsilon_\theta(x_t, t, s_{1:T})$ .
State Conditioning via Transformer: State sequence embeddings $s_{1:T}$ are embedded and used as keys and values in a Transformer. The query is derived from U-Net bottleneck features, allowing the policy to condition action sampling on the entire state trajectory.
Multi-Seed Ensemble: Rather than training multiple networks, multiple action sequence samples are generated during inference by initializing the reverse diffusion process with independent Gaussian noise seeds $\epsilon_i\sim\mathcal{N}(0, I)$ . Each seed $\epsilon_i$ gives rise to a distinct sub-policy $\pi_\theta^i$ , reflecting a different behavior mode found in the underlying dataset.

2. Diffusion Process: Forward and Reverse Dynamics

The policy exploits the Denoising Diffusion Probabilistic Model (DDPM) framework, adapting it to action sequence generation:

Forward Process: For timesteps $t = 1, \dots, T$ , the noisy action sequence evolves according to

$q(x_t \mid x_{t-1}) = \mathcal{N}\big(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t I\big)$

with cumulative noise $\bar\alpha_t = \prod_{i=1}^t (1 - \beta_i)$ .

Reverse (Denoising) Process: The parameterized model reconstructs denoised samples via

$p_\theta(x_{t-1} \mid x_t, s_{1:T}) = \mathcal{N}\big(x_{t-1}; \mu_\theta(x_t, t, s_{1:T}), \Sigma_t\big)$

where

$\mu_\theta(x_t, t, s_{1:T}) = \frac{1}{\sqrt{\alpha_t}}\left[ x_t - \frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\, \epsilon_\theta(x_t, t, s_{1:T}) \right]$

and in practice $\Sigma_t = \beta_t I$ .

During inference, each seed initializes $x_T = \epsilon_i$ , and the denoising process is executed independently for each sub-policy.

3. Training Objectives and Diversity Regularization

Training jointly optimizes two primary objectives:

Standard DDPM Score-Matching Loss:

$\mathcal{L}_{\mathrm{diff}} = \mathbb{E}_{t, x_0, \epsilon}\Big[ \|\epsilon - \epsilon_\theta(\sqrt{\bar\alpha_t} x_0 + \sqrt{1-\bar\alpha_t} \epsilon, t, s_{1:T})\|_2^2 \Big]$

where $\epsilon \sim \mathcal{N}(0, I)$ and $t$ is uniformly sampled, following the canonical DDPM formulation.

Sequence-Level KL Regularization (Ensemble Spread): To enforce global diversity among the sub-policies, a sequence-level KL regularization term is introduced post-training (fine-tuning or re-weighting). For each sub-policy $\pi^i_\theta$ ,

$J(\pi^i_\theta) = \mathbb{E}_{(s, a)\sim \mathcal{D}}\big[\log p_\theta(a \mid s)\big] + \alpha\, \mathbb{E}_{(s,a)\sim\mathcal{D}}\left[\log\frac{p_\theta(a \mid s)}{\max_{j < i}\, p_\theta(a \mid s)}\right]$

The final training objective is:

$\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{diff}} - \sum_{i=1}^n J(\pi^i_\theta)$

where the log-likelihood term promotes accurate modeling while the $\alpha$ -weighted term encourages the spread of the ensemble. The hyperparameter $\alpha$ typically ranges from 0.1 to 1.0.

4. Inference and Multi-Seed Sampling Algorithm

The multi-seed sampling process generates an ensemble in which each member represents a distinct dynamic mode. The algorithm, as specified, operates as follows:

For each of the $n$ $n$ seeds $\epsilon_i$ $ϵ_{i}$ :
- Initialize $x_T^i \leftarrow \epsilon_i$ .
- For $t = T, \dots, 1$ $t = T, \dots, 1$ :
  - Predict $\epsilon_\theta(x_t^i, t, s_{1:T})$ .
  - Compute mean $\mu_t$ for reverse diffusion.
  - Sample $x_{t-1}^i\sim \mathcal{N}(\mu_t, \beta_t I)$ .
  - For each $j < i$ , compute dynamics-level divergence. If below the threshold $\tau$ , inject a Gaussian perturbation of scale $\sigma_{\mathrm{div}} = \eta (\tau - \mathrm{div})/\tau$ to the current sample.
- After $t=1$ , set $a_{1:T}^i \leftarrow x_0^i$ .
Return ensemble $\{a_{1:T}^i\}_{i=1}^n$ .

A summary table organizes the major elements:

Component	Value or Hyperparameter	Notes
Diffusion steps ( $T$ )	50–100
Noise schedule ( $\{\beta_t\}$ )	Linear or cosine	As in DDPM
Number of seeds ( $n$ )	4–8	Balances behavior coverage and compute cost
Divergence threshold ( $\tau$ )	10th percentile of empirical divergences	From offline dataset
Perturbation scale ( $\eta$ )	Chosen for $\sigma_{\mathrm{div}} \approx 0.1$ when $\mathrm{div}=0$	Adaptive to sample redundancy
KL reg. weight ( $\alpha$ )	0.1–1.0

5. Dynamics-Level Diversity Enforcement

Diversity among the policies in the ensemble is directly enforced at the dynamics level by a divergence metric that incorporates first- and second-order action differences. For any two action sequences $a^i, a^j$ , the divergence,

$\mathrm{div}(a^i, a^j) = \frac{1}{T}\sum_{t=1}^T \left[\| \dot a^i_t - \dot a^j_t \|_2 + \big(1 - \cos(\ddot a^i_t, \ddot a^j_t)\big)\right]$

with $\dot a_t = a_t - a_{t-1}$ and $\ddot a_t = \dot a_t - \dot a_{t-1}$ , quantifies differences both in velocities and accelerations over a trajectory. If the divergence falls below a learned threshold $\tau$ , an adaptive perturbation is injected to encourage further exploration of distinct dynamic regimes.

This suggests that the ensemble is not merely diverse in a statistical sense, but that the diversity is explicitly structured to be physically meaningful with respect to the agent's behavior.

6. Implementation Considerations and Practical Use

Key implementation parameters are as follows:

Network Details: U-Net of depth 4, width 256; Transformer of 4 layers, 8 heads, hidden size 512.
Optimization: AdamW optimizer, learning rate $2 \times 10^{-4}$ , batch size 128, typically 200 epochs.
Initialization: Each sub-policy is generated via a different Gaussian noise seed.
Downstream Integration: The resulting policies form a robust, expressive foundation for online fine-tuning in O2O-RL pipelines.

A plausible implication is that this setup enables practitioners to generate policies that cover a wide set of behavioral modes from a single model instance, reducing model and computational complexity relative to training many independent diffusion networks.

7. Significance and Context within O2O-RL

The multi-seed dynamics-aware diffusion policy addresses two major O2O-RL bottlenecks: limited multimodal behavioral coverage and distributional shift during adaptation. By consolidating the modeling of multiple behaviors into one generative network and augmenting diversity at the dynamics level, it circumvents the need for separate model training for each mode. Empirical results show absolute improvements of +5.9% on locomotion tasks and +12.4% on dexterous manipulation in the D4RL benchmark compared to strong baselines, indicating enhanced generalization and scalability (Huang et al., 13 Nov 2025).

These methodological advances position the multi-seed dynamics-aware diffusion policy as a foundational technique for O2O-RL scenarios that demand both flexibility in policy deployment and robustness to distributional shifts in real-world robotic contexts.

PDF Markdown Chat (Pro)

References (1)

Opinion: Towards Unified Expressive Policy Optimization for Robust Robot Learning (2025)

Follow Topic

Get notified by email when new papers are published related to Multi-Seed Dynamics-Aware Diffusion Policy.