- The paper presents the RDP objective, reformulating RL finetuning as a supervised task predicting reward differences between image pairs.
- It employs proximal updates inspired by PPO to ensure stability and bounded optimization during training.
- Experiments demonstrate PRDP's superior performance on both small- and large-scale datasets, enhancing image generation quality on unseen prompts.
Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models
The paper proposes Proximal Reward Difference Prediction (PRDP), a novel method designed to enable stable black-box reward finetuning of diffusion models on large-scale prompt datasets. This paper addresses the limitations of existing RL-based reward finetuning methods in the vision domain, particularly their instability in large-scale training scenarios, which compromises their ability to generalize to complex, unseen prompts.
Background
Diffusion models have demonstrated significant success in generative modeling of continuous data, including photorealistic text-to-image synthesis. However, their maximum likelihood training objective often misaligns with downstream requirements like generating novel object compositions and aesthetically preferred images. In the language domain, Reinforcement Learning from Human Feedback (RLHF) has been adopted to align LLMs with human preferences, showing notable success. Inspired by this, analogous reward models have been developed for the vision domain, such as HPSv2 and PickScore. Yet, applying RL-based finetuning methods to diffusion models, as attempted by methods like DDPO, reveals inherent instability in large-scale setups.
Proximal Reward Difference Prediction (PRDP)
PRDP is introduced to address the instability of RL-based approaches by proposing a supervised regression objective for finetuning diffusion models, named Reward Difference Prediction (RDP). The innovation centers on a regression task where the diffusion model predicts the reward difference between pairs of generated images, provided from their denoising trajectories. The paper theoretically establishes that achieving perfect reward difference prediction results in diffusion models that maximize the RL objective.
Key Contributions
- RDP Objective: The RDP objective is designed to inherit the optimal solution of the RL objective while providing enhanced training stability. This is formulated as a supervised learning task, predicting reward differences between image pairs generated from text prompts.
- Proximal Updates: To mitigate training instability, the authors propose proximal updates inspired by Proximal Policy Optimization (PPO). This involves clipping the log probability ratios to ensure stable and bounded optimization steps.
- Online Optimization Algorithm: To enhance training stability and performance, the authors employ an online optimization strategy where diffusion models are updated iteratively while sampling new data points, avoiding the pitfalls of static datasets.
Experimental Validation
PRDP is evaluated through a series of experiments:
- Small-Scale Finetuning: The method is tested on a dataset of 45 prompts with HPSv2 and PickScore as reward models. PRDP matches and slightly exceeds the performance of DDPO, demonstrating its efficacy in small-scale settings.
- Large-Scale Finetuning: The research demonstrates PRDP's capability to handle over 100K prompts from the Human Preference Dataset v2 (HPDv2) and achieves superior generation quality on previously unseen prompts. PRDP maintains stability where DDPO fails.
- Multi-Reward Finetuning: Comprehensive training on mixed rewards showcases PRDP's superiority in generating higher quality images under complex and diverse prompt sets.
Implications and Future Work
The findings of this research have substantial implications for the practical application and theoretical understanding of diffusion models in generative tasks. PRDP's robustness and stability in large-scale finetuning contexts suggest broad applicability in domains requiring high-quality, diverse image generation. Additionally, the integration of supervised learning concepts to approach traditionally RL-driven objectives may inspire further novel strategies across different areas of AI model finetuning.
Future developments may explore further optimization techniques or hybrid approaches that blend the benefits of supervised learning stability with refined RL techniques. Researchers might also investigate extending PRDP to other model architectures and additional data modalities, thereby expanding its utility and impact.
Conclusion
PRDP represents a significant step forward in stable large-scale reward finetuning of diffusion models, offering a practical and theoretically sound alternative to RL-based methods. By converting the RLHF objective into a supervised learning task and incorporating proximal updates, PRDP provides a scalable, stable solution for enhancing diffusion models under broad and complex generative tasks.