Papers
Topics
Authors
Recent
2000 character limit reached

Multi-Seed Dynamics-Aware Diffusion Policy

Updated 20 November 2025
  • The paper introduces a novel multi-seed diffusion framework that models diverse behavioral modes in O2O-RL, yielding significant improvements in locomotion and manipulation tasks.
  • It utilizes a U-Net backbone with Transformer-based state conditioning to integrate state trajectories, enabling distinct sub-policies from independent Gaussian noise seeds.
  • Diversity is enforced via dynamics-level KL regularization, ensuring physically meaningful action variations and robust generalization across different real-world scenarios.

The multi-seed dynamics-aware diffusion policy is a generative policy architecture and training strategy designed to address the challenges of multimodal behavior representation and distributional robustness in offline-to-online reinforcement learning (O2O-RL). It enables a single diffusion network to efficiently model a diverse ensemble of sub-policies, each corresponding to a distinct behavioral mode, through the use of multiple diffusion noise seeds. This framework incorporates explicit dynamics-level diversity regularization to ensure that the resulting action sequences (policies) are not only diverse but also physically meaningful, supporting enhanced generalization and applicability in robotic learning regimes (Huang et al., 13 Nov 2025).

1. Policy Network Architecture

The core architecture consists of a U-Net backbone augmented with Transformer-based cross-attention for state conditioning. The following components are central:

  • U-Net Backbone: Processes a noisy action sequence xtRT×dax_t\in\mathbb{R}^{T \times d_a} at each diffusion step tt and produces a denoised estimate ϵθ(xt,t,s1:T)\epsilon_\theta(x_t, t, s_{1:T}).
  • State Conditioning via Transformer: State sequence embeddings s1:Ts_{1:T} are embedded and used as keys and values in a Transformer. The query is derived from U-Net bottleneck features, allowing the policy to condition action sampling on the entire state trajectory.
  • Multi-Seed Ensemble: Rather than training multiple networks, multiple action sequence samples are generated during inference by initializing the reverse diffusion process with independent Gaussian noise seeds ϵiN(0,I)\epsilon_i\sim\mathcal{N}(0, I). Each seed ϵi\epsilon_i gives rise to a distinct sub-policy πθi\pi_\theta^i, reflecting a different behavior mode found in the underlying dataset.

2. Diffusion Process: Forward and Reverse Dynamics

The policy exploits the Denoising Diffusion Probabilistic Model (DDPM) framework, adapting it to action sequence generation:

  • Forward Process: For timesteps t=1,,Tt = 1, \dots, T, the noisy action sequence evolves according to

q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t \mid x_{t-1}) = \mathcal{N}\big(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t I\big)

with cumulative noise αˉt=i=1t(1βi)\bar\alpha_t = \prod_{i=1}^t (1 - \beta_i).

  • Reverse (Denoising) Process: The parameterized model reconstructs denoised samples via

pθ(xt1xt,s1:T)=N(xt1;μθ(xt,t,s1:T),Σt)p_\theta(x_{t-1} \mid x_t, s_{1:T}) = \mathcal{N}\big(x_{t-1}; \mu_\theta(x_t, t, s_{1:T}), \Sigma_t\big)

where

μθ(xt,t,s1:T)=1αt[xtβt1αˉtϵθ(xt,t,s1:T)]\mu_\theta(x_t, t, s_{1:T}) = \frac{1}{\sqrt{\alpha_t}}\left[ x_t - \frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\, \epsilon_\theta(x_t, t, s_{1:T}) \right]

and in practice Σt=βtI\Sigma_t = \beta_t I.

During inference, each seed initializes xT=ϵix_T = \epsilon_i, and the denoising process is executed independently for each sub-policy.

3. Training Objectives and Diversity Regularization

Training jointly optimizes two primary objectives:

  • Standard DDPM Score-Matching Loss:

Ldiff=Et,x0,ϵ[ϵϵθ(αˉtx0+1αˉtϵ,t,s1:T)22]\mathcal{L}_{\mathrm{diff}} = \mathbb{E}_{t, x_0, \epsilon}\Big[ \|\epsilon - \epsilon_\theta(\sqrt{\bar\alpha_t} x_0 + \sqrt{1-\bar\alpha_t} \epsilon, t, s_{1:T})\|_2^2 \Big]

where ϵN(0,I)\epsilon \sim \mathcal{N}(0, I) and tt is uniformly sampled, following the canonical DDPM formulation.

  • Sequence-Level KL Regularization (Ensemble Spread): To enforce global diversity among the sub-policies, a sequence-level KL regularization term is introduced post-training (fine-tuning or re-weighting). For each sub-policy πθi\pi^i_\theta,

J(πθi)=E(s,a)D[logpθ(as)]+αE(s,a)D[logpθ(as)maxj<ipθ(as)]J(\pi^i_\theta) = \mathbb{E}_{(s, a)\sim \mathcal{D}}\big[\log p_\theta(a \mid s)\big] + \alpha\, \mathbb{E}_{(s,a)\sim\mathcal{D}}\left[\log\frac{p_\theta(a \mid s)}{\max_{j < i}\, p_\theta(a \mid s)}\right]

The final training objective is:

Ltotal=Ldiffi=1nJ(πθi)\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{diff}} - \sum_{i=1}^n J(\pi^i_\theta)

where the log-likelihood term promotes accurate modeling while the α\alpha-weighted term encourages the spread of the ensemble. The hyperparameter α\alpha typically ranges from 0.1 to 1.0.

4. Inference and Multi-Seed Sampling Algorithm

The multi-seed sampling process generates an ensemble in which each member represents a distinct dynamic mode. The algorithm, as specified, operates as follows:

  1. For each of the nn seeds ϵi\epsilon_i:
    • Initialize xTiϵix_T^i \leftarrow \epsilon_i.
    • For t=T,,1t = T, \dots, 1:
      • Predict ϵθ(xti,t,s1:T)\epsilon_\theta(x_t^i, t, s_{1:T}).
      • Compute mean μt\mu_t for reverse diffusion.
      • Sample xt1iN(μt,βtI)x_{t-1}^i\sim \mathcal{N}(\mu_t, \beta_t I).
      • For each j<ij < i, compute dynamics-level divergence. If below the threshold τ\tau, inject a Gaussian perturbation of scale σdiv=η(τdiv)/τ\sigma_{\mathrm{div}} = \eta (\tau - \mathrm{div})/\tau to the current sample.
    • After t=1t=1, set a1:Tix0ia_{1:T}^i \leftarrow x_0^i.
  2. Return ensemble {a1:Ti}i=1n\{a_{1:T}^i\}_{i=1}^n.

A summary table organizes the major elements:

Component Value or Hyperparameter Notes
Diffusion steps (TT) 50–100
Noise schedule ({βt}\{\beta_t\}) Linear or cosine As in DDPM
Number of seeds (nn) 4–8 Balances behavior coverage and compute cost
Divergence threshold (τ\tau) 10th percentile of empirical divergences From offline dataset
Perturbation scale (η\eta) Chosen for σdiv0.1\sigma_{\mathrm{div}} \approx 0.1 when div=0\mathrm{div}=0 Adaptive to sample redundancy
KL reg. weight (α\alpha) 0.1–1.0

5. Dynamics-Level Diversity Enforcement

Diversity among the policies in the ensemble is directly enforced at the dynamics level by a divergence metric that incorporates first- and second-order action differences. For any two action sequences ai,aja^i, a^j, the divergence,

div(ai,aj)=1Tt=1T[a˙tia˙tj2+(1cos(a¨ti,a¨tj))]\mathrm{div}(a^i, a^j) = \frac{1}{T}\sum_{t=1}^T \left[\| \dot a^i_t - \dot a^j_t \|_2 + \big(1 - \cos(\ddot a^i_t, \ddot a^j_t)\big)\right]

with a˙t=atat1\dot a_t = a_t - a_{t-1} and a¨t=a˙ta˙t1\ddot a_t = \dot a_t - \dot a_{t-1}, quantifies differences both in velocities and accelerations over a trajectory. If the divergence falls below a learned threshold τ\tau, an adaptive perturbation is injected to encourage further exploration of distinct dynamic regimes.

This suggests that the ensemble is not merely diverse in a statistical sense, but that the diversity is explicitly structured to be physically meaningful with respect to the agent's behavior.

6. Implementation Considerations and Practical Use

Key implementation parameters are as follows:

  • Network Details: U-Net of depth 4, width 256; Transformer of 4 layers, 8 heads, hidden size 512.
  • Optimization: AdamW optimizer, learning rate 2×1042 \times 10^{-4}, batch size 128, typically 200 epochs.
  • Initialization: Each sub-policy is generated via a different Gaussian noise seed.
  • Downstream Integration: The resulting policies form a robust, expressive foundation for online fine-tuning in O2O-RL pipelines.

A plausible implication is that this setup enables practitioners to generate policies that cover a wide set of behavioral modes from a single model instance, reducing model and computational complexity relative to training many independent diffusion networks.

7. Significance and Context within O2O-RL

The multi-seed dynamics-aware diffusion policy addresses two major O2O-RL bottlenecks: limited multimodal behavioral coverage and distributional shift during adaptation. By consolidating the modeling of multiple behaviors into one generative network and augmenting diversity at the dynamics level, it circumvents the need for separate model training for each mode. Empirical results show absolute improvements of +5.9% on locomotion tasks and +12.4% on dexterous manipulation in the D4RL benchmark compared to strong baselines, indicating enhanced generalization and scalability (Huang et al., 13 Nov 2025).

These methodological advances position the multi-seed dynamics-aware diffusion policy as a foundational technique for O2O-RL scenarios that demand both flexibility in policy deployment and robustness to distributional shifts in real-world robotic contexts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Multi-Seed Dynamics-Aware Diffusion Policy.