Fine-tuning Diffusion Policies with Backpropagation Through Diffusion Timesteps (2505.10482v2)

Published 15 May 2025 in cs.LG and cs.AI

Abstract: Diffusion policies, widely adopted in decision-making scenarios such as robotics, gaming and autonomous driving, are capable of learning diverse skills from demonstration data due to their high representation power. However, the sub-optimal and limited coverage of demonstration data could lead to diffusion policies that generate sub-optimal trajectories and even catastrophic failures. While reinforcement learning (RL)-based fine-tuning has emerged as a promising solution to address these limitations, existing approaches struggle to effectively adapt Proximal Policy Optimization (PPO) to diffusion models. This challenge stems from the computational intractability of action likelihood estimation during the denoising process, which leads to complicated optimization objectives. In our experiments starting from randomly initialized policies, we find that online tuning of Diffusion Policies demonstrates much lower sample efficiency compared to directly applying PPO on MLP policies (MLP+PPO). To address these challenges, we introduce NCDPO, a novel framework that reformulates Diffusion Policy as a noise-conditioned deterministic policy. By treating each denoising step as a differentiable transformation conditioned on pre-sampled noise, NCDPO enables tractable likelihood evaluation and gradient backpropagation through all diffusion timesteps. Our experiments demonstrate that NCDPO achieves sample efficiency comparable to MLP+PPO when training from scratch, outperforming existing methods in both sample efficiency and final performance across diverse benchmarks, including continuous robot control and multi-agent game scenarios. Furthermore, our experimental results show that our method is robust to the number denoising timesteps in the Diffusion Policy.

Summary

The paper reformulates diffusion timesteps as noise-conditioned deterministic transformations, enabling efficient gradient computation.
It demonstrates that full backpropagation through timesteps improves sample efficiency to levels comparable with traditional MLP+PPO methods.
Experimental results confirm enhanced robustness and performance in diverse RL tasks including robotic control and multi-agent games.

Fine-tuning Diffusion Policies with Backpropagation Through Diffusion Timesteps

The paper "Fine-tuning Diffusion Policies with Backpropagation Through Diffusion Timesteps" by Ningyuan Yang and collaborators presents a novel framework, NCDPO (Noise-Conditioned Diffusion Policy Optimization), designed to enhance the efficacy and efficiency of diffusion policies in decision-making scenarios. These scenarios encompass area such as robotics, autonomous driving, and gaming, where the ability to learn diverse skills from demonstration data is imperative. However, diffusion policies often struggle with sub-optimal performance due to the limitations and coverage gaps inherent in the demonstration data.

A fundamental challenge when fine-tuning these policies using Reinforcement Learning (RL), particularly methods like Proximal Policy Optimization (PPO), is the computational complexity associated with evaluating action likelihood during the denoising process. This complexity hinders the adaptation of diffusion models, often resulting in lower sample efficiency compared to traditional MLP-based policies using PPO (MLP+PPO). The paper addresses this limitation by introducing NCDPO, which reconfigures diffusion policy into a noise-conditioned deterministic policy, thereby enabling tractable likelihood estimation and gradient backpropagation across all diffusion timesteps.

The core contribution of this work is threefold:

Reformulation of Diffusion Process: The NCDPO approach views the denoising steps of a diffusion policy as deterministic transformations contingent on presampled noise. This allows for straightforward policy gradient computation, enabling efficient training through differentiation across the diffusion process.
Improved Sample Efficiency: By enabling gradients to backpropagate through all diffusion timesteps, NCDPO achieves sample efficiency comparable to MLP+PPO, even when training from scratch. This positions NCDPO as more sample-efficient than existing methods, which often suffer from inefficiencies due to enlarged action spaces in RL training.
Robustness and Performance: Experimental evaluations show that NCDPO not only achieves substantial improvements in sample efficiency but also demonstrates robustness across various settings, including continuous robot control and multi-agent games. The method consistently outperforms baselines in both performance and robustness to the number of diffusion timesteps.

The implications of this research are significant for both theoretical understanding and practical applications of diffusion policies in RL environments. Theoretically, it indicates a pathway to overcoming challenges posed by the computational intractability of action likelihoods, paving the way for more efficient policy optimization frameworks. Practically, it provides a robust and versatile tool that enhances the adaptability and capability of diffusion policies, equipping them to better address real-world decision-making tasks with diverse environmental interactions.

Future research could explore extending these findings to real-world applications beyond simulation environments, examining the transferability of fine-tuned diffusion policies to physically embodied systems such as autonomous vehicles and robotic manipulation tasks. Additionally, investigating potential enhancements in policy representation and exploring the integration of adaptive diffusion models could further broaden the utility and effectiveness of this innovative approach.

YouTube

Show All Videos

Fine-tuning Diffusion Policies with Backpropagation Through Diffusion Timesteps (2505.10482v2)

Summary

Fine-tuning Diffusion Policies with Backpropagation Through Diffusion Timesteps

Related Papers

YouTube