DynaGuide: Steering Diffusion Polices with Active Dynamic Guidance (2506.13922v1)

Published 16 Jun 2025 in cs.RO

Abstract: Deploying large, complex policies in the real world requires the ability to steer them to fit the needs of a situation. Most common steering approaches, like goal-conditioning, require training the robot policy with a distribution of test-time objectives in mind. To overcome this limitation, we present DynaGuide, a steering method for diffusion policies using guidance from an external dynamics model during the diffusion denoising process. DynaGuide separates the dynamics model from the base policy, which gives it multiple advantages, including the ability to steer towards multiple objectives, enhance underrepresented base policy behaviors, and maintain robustness on low-quality objectives. The separate guidance signal also allows DynaGuide to work with off-the-shelf pretrained diffusion policies. We demonstrate the performance and features of DynaGuide against other steering approaches in a series of simulated and real experiments, showing an average steering success of 70% on a set of articulated CALVIN tasks and outperforming goal-conditioning by 5.4x when steered with low-quality objectives. We also successfully steer an off-the-shelf real robot policy to express preference for particular objects and even create novel behavior. Videos and more can be found on the project website: https://dynaguide.github.io

Authors (2)

Maximilian Du (8 papers)
Shuran Song (110 papers)

Summary

The paper introduces DynaGuide to actively steer diffusion policies using a latent visual dynamics model that guides actions towards multiple desired and undesired outcomes.
It modifies the denoising process with a log-sum-exp guidance metric and stochastic sampling to balance complex visual conditions without re-training.
Experiments in simulation and on a real robot show up to 70% success rates and significant improvements over traditional goal-conditioned policies.

This paper introduces DynaGuide, a method for steering pre-trained diffusion policies for robotic manipulation. The core idea is to leverage an external, separately trained dynamics model to provide guidance during the diffusion policy's action denoising process. This approach offers flexibility and robustness compared to traditional goal-conditioned policies that require re-training or fail on out-of-distribution goals.

DynaGuide operates by modifying the inference-time denoising process of a diffusion policy. Given a current observation $o_t$ and a set of guidance conditions $\mathcal{G} = \mathbf{g}^+ \cup \mathbf{g}^-$ , representing desired ( $\mathbf{g}^+$ ) and undesired ( $\mathbf{g}^-$ ) visual outcomes, DynaGuide aims to generate an action sequence $\mathbf{a}$ such that the predicted future state aligns well with $\mathbf{g}^+$ and poorly with $\mathbf{g}^-$ .

The method relies on a latent visual dynamics model ( $h_\theta$ ) and an image embedder ( $\phi$ ). The dynamics model is a transformer that takes the current observation's latent representation ( $\phi(o_t)$ ) and a sequence of actions $\mathbf{a}$ to predict a future latent state $\hat{z}_{t+H}$ . The image embedder $\phi$ is frozen (using a pre-trained model like DinoV2 (Oquab et al., 2023 )), projecting observations into a latent space. The dynamics model $h_\theta$ is trained with a regression objective to predict the latent representation of a future observation $o_{t+H}$ : $\mathcal{L}(o_t, o_{t+H}, \mathbf{a}) = ||\phi(o_{t+H}) - h_\theta(\phi(o_{t}), \mathbf{a})||_2^2$ . A crucial implementation detail is training the dynamics model with noisy actions (sampled from the same diffusion noise schedule) alongside noiseless actions to ensure it can provide meaningful gradients during the policy's noisy denoising process.

A guidance metric $\mathbf{d}$ is defined using the dynamics model's prediction and the guidance conditions. This metric quantifies how well the predicted future state $\hat{z}_{t+H}$ matches the desired outcomes and avoids the undesired ones in the latent space. The metric is formulated as a difference of log-sum-exps of negative squared Euclidean distances (or empirically, Euclidean distances) between $\hat{z}_{t+H}$ and the latent representations of the guidance conditions:

$\mathbf{d} = \log \left[\sum_i \exp\frac{-||\phi(g_i^+) - h_\theta(\phi(o_t), \mathbf{a})||^2_2}{\sigma}\right] - \log \left[\sum_j \exp\frac{-||\phi(g_j^-) - h_\theta(\phi(o_t), \mathbf{a})||^2_2}{\sigma} \right]$

Here, $\sigma$ is a hyperparameter modulating the sharpness of the soft maximum. This log-sum-exp structure allows DynaGuide to steer towards any one of multiple desired outcomes and away from any one of multiple undesired outcomes simultaneously.

During the diffusion policy's denoising process, which typically estimates the noise $\epsilon(\mathbf{a}^k, o_t)$ to iteratively refine a noisy action $\mathbf{a}^k$ to a denoised $\mathbf{a}^{k-1}$ , DynaGuide incorporates a guidance gradient. Inspired by classifier guidance (Dhariwal et al., 2021 ), the estimated noise is modified using the gradient of the guidance metric with respect to the current noisy action $\mathbf{a}^k$ :

$\hat{\epsilon}(\mathbf{a}^{k}, o_t) = \epsilon(\mathbf{a}^{k}, o_t) - s\sqrt{1 - \bar{\alpha_k}}\nabla_{\mathbf{a}^{k}}\mathbf{d}(\mathbf{g}^+, \mathbf{g}^-, o_t, \mathbf{a}^k)$

where $s$ is the guidance strength scale and $\bar{\alpha_k}$ is related to the noise schedule. A higher $s$ means stronger guidance, but can lead to instability. To mitigate this, DynaGuide employs stochastic sampling by performing multiple denoising steps and averaging, a technique similar to ITPS (Wang et al., 25 Nov 2024 ).

Key Advantages & Practical Implications:

Flexible Steering Structure: Unlike fixed-input goal-conditioned policies, DynaGuide handles multiple positive and negative visual conditions dynamically at inference time. This allows for complex steering objectives not conceivable with single-goal inputs.
Increased Steering Robustness: By separating the dynamics model and leveraging powerful pre-trained visual features (DinoV2), DynaGuide is more robust to lower-quality or out-of-distribution guidance conditions than goal-conditioned policies, which often fail when the goal image doesn't exactly match the current environment's possibilities. The log-sum-exp metric helps average noisy signals from multiple conditions.
Plug-and-Play Modularity: DynaGuide only modifies the inference process, making it compatible with any pre-trained diffusion policy without requiring re-training the policy weights. This is demonstrated by successfully steering an off-the-shelf policy on a real robot.
Enhancing Underrepresented Behaviors: By actively guiding the denoising process, DynaGuide can steer the policy towards action modes that are rare or underrepresented in the base policy's training data, a task difficult for sample-based steering methods that only select from modes the policy already readily generates.

Experimental Validation:

The paper validates DynaGuide extensively in simulation using the CALVIN environment (Mees et al., 2021 ) and on a real robot.

CALVIN Experiments:
- DynaGuide significantly increases target behavior success rates (up to 70%) compared to the base policy on tasks involving articulated parts and movable objects.
- It outperforms sampling-based guidance (GPC) on precise tasks and goal-conditioned policies when dealing with movable objects, where goal images are less perfectly matched.
- On underspecified guidance conditions (where goal images don't show the robot or other irrelevant details), DynaGuide outperforms goal conditioning by 5.4x on average, demonstrating its robustness to noisy objectives.
- DynaGuide successfully steers the policy to achieve multiple desired behaviors and avoid multiple undesired behaviors simultaneously.
- It demonstrates the ability to enhance behaviors that were severely underrepresented (e.g., 1% of data) in the base policy's training set.
Real Robot Experiments: DynaGuide successfully steered an off-the-shelf diffusion policy [chi_universal_2024] on a real ARX5 robot arm [lin2024data] to:
- Express preference for a specific cup color when multiple cups were present (72.5% success).
- Reach for a desired cup even when a closer cup obstructed it (80% success).
- Even create novel behavior (interacting with a computer mouse, which was not in the base policy training data, but was in the dynamics model training data), doubling interaction frequency with the novel object.

Implementation Details & Compute:

The implementation uses standard transformer architectures for the dynamics model and U-Nets for the diffusion policy, conditioned by ResNet-18 for vision and MLPs for proprioception. Training requires single RTX 3090 GPUs for 24-48 hours per model. Inference on a 3090 takes 10-20 minutes per seed per task. The dynamics model is around 15M parameters, requiring ~4GB VRAM for inference. Stochastic sampling ( $M=4$ ) improves stability without excessive computational cost.

Limitations:

A primary limitation is the difficulty in specifying the method of achieving an objective using only outcome observations as guidance conditions. Future work could explore incorporating multimodal guidance (e.g., language or kinesthetic demonstrations) for finer-grained control.

In summary, DynaGuide provides a practical and effective method for steering pre-trained diffusion robot policies using external dynamics guidance. Its modularity, robustness, and ability to enhance rare behaviors make it a promising approach for deploying large robot policies in dynamic real-world scenarios without requiring extensive retraining.

PDF Markdown

Related Papers

Find Related Papers

GitHub

Tweets

https://twitter.com/du_maximilian/status/1935352033718190335

https://twitter.com/du_maximilian/status/1935352050763895169