- The paper presents a comprehensive analysis of DPO and GRPO in treating image generation as a chain-of-thought reasoning task.
- The study finds that DPO excels in in-domain performance while GRPO demonstrates superior generalization on out-of-domain tasks.
- Effective scaling and reward model selection are shown to be critical for balancing performance and mitigating overfitting.
This paper, "Delving into RL for Image Generation with CoT: A Study on DPO vs. GRPO" (2505.17017), explores the application of Reinforcement Learning (RL) algorithms, specifically Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO), to autoregressive image generation models, viewing the generation process as a Chain-of-Thought (CoT) reasoning task. Unlike text-based CoT, image generation involves unique challenges such as ensuring text-image consistency, image quality, and sophisticated reward modeling. The paper aims to provide a comprehensive investigation into the performance and characteristics of GRPO and DPO in this domain, analyzing their in-domain (ID) and out-of-domain (OOD) performance, the influence of different reward models, and the effects of various scaling strategies.
The authors use Janus-Pro (2501.17811), a state-of-the-art autoregressive image generation model, as their baseline. Performance is evaluated on T2I-CompBench [huang2023t2i] for in-domain assessment (complex, detailed prompts) and GenEval [ghosh2023geneval] for out-of-domain generalization (short, templated prompts).
The investigation compares GRPO (an on-policy method based on PPO but using group-wise normalization for advantage estimation) and DPO (an off-policy method directly optimizing preferences). To ensure a fair comparison, training data curation and computational costs are aligned. For GRPO, training uses T2I-CompBench prompts and generates responses on-the-fly (group size of 4). For DPO, preference pairs are constructed by generating images for T2I-CompBench prompts (aligning sampled images per prompt with GRPO's group size) and scoring them with a reward model to select the highest and lowest-scoring images as chosen/rejected pairs.
Key Findings:
- In-Domain vs. Out-of-Domain Performance:
- DPO demonstrates superior performance on the in-domain T2I-CompBench dataset, consistently outperforming GRPO.
- GRPO exhibits stronger generalization capabilities on the out-of-domain GenEval dataset compared to DPO.
- Impact of Different Reward Models:
- The paper examines three types of reward models: Human Preference Models (HPS [wu2023hps], ImageReward [xu2023imagereward]), Visual Question Answering Models (UnifiedReward [wang2025unified], Fine-tuned ORM [guo2025can]), and Metric Reward.
- DPO's generalization performance is more sensitive to the choice of reward model than GRPO's, showing larger fluctuations.
- Reward models with better intrinsic generalization capabilities (evaluated by a best-of-N strategy on GenEval) tend to improve the generalization performance of both GRPO and DPO, suggesting the reward model's generalization capacity is crucial for RL generalization.
- Investigation of Effective Scaling Strategies:
- Three scaling strategies are explored: scaling sampled images per prompt, scaling in-domain training data diversity/quantity (using a GPT-4o-based pipeline to expand T2I-CompBench), and iterative training (GRPO-Iter, DPO-Iter).
- For GRPO: Scaling sampled images per prompt (increasing group size) is found to be computationally efficient for boosting in-domain performance. Moderate scaling of both sample size and in-domain data improves generalization, but excessive scaling can lead to overfitting and diminish generalization gains.
- For DPO: Iterative training (DPO-Iter) significantly enhances in-domain performance but can degrade generalization after several iterations due to overfitting to the training preference data. Moderate sample sizes (for preference pair selection) optimize preference contrast and improve both in-domain and out-of-domain performance, while excessive sampling can introduce bias. Scaling the in-domain data volume for DPO helps mitigate preference bias from limited data and improves both in-domain and generalization performance.
Practical Implications for Implementation:
- For tasks requiring high performance primarily on a well-defined, complex in-domain distribution (like T2I-CompBench), DPO might be the preferred algorithm due to its strong ID performance.
- For tasks where generalization to new, potentially simpler or templated prompts is critical, GRPO shows better promise.
- The choice and quality of the reward model are paramount. Utilizing reward models with demonstrated generalization capabilities is crucial for achieving good OOD performance with RL fine-tuning, especially for DPO which is more sensitive to RM variations.
- Scaling strategies should be chosen carefully based on the target objective (ID performance vs. OOD generalization) and the algorithm used:
- If using GRPO and aiming for efficient ID performance gains, prioritize scaling the group size (sampled images per prompt). To improve generalization, moderate scaling of group size and data is advisable.
- If using DPO and aiming for high ID performance, iterative training is effective, but monitor OOD performance to avoid degradation. To balance ID and OOD performance, moderate sampling sizes for preference pair creation and scaling the training data are beneficial.
The paper provides valuable empirical insights into applying RL methods for complex generative tasks like autoregressive image generation, highlighting the distinct behaviors of on-policy (GRPO) and off-policy (DPO) algorithms and offering guidance on selecting reward models and scaling strategies for specific performance goals. The code is released at \url{https://github.com/ZiyuGuo99/Image-Generation-CoT}, providing a practical starting point for practitioners.