Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step (2501.13926v1)

Published 23 Jan 2025 in cs.CV, cs.AI, and cs.CL

Abstract: Chain-of-Thought (CoT) reasoning has been extensively explored in large models to tackle complex understanding tasks. However, it still remains an open question whether such strategies can be applied to verifying and reinforcing image generation scenarios. In this paper, we provide the first comprehensive investigation of the potential of CoT reasoning to enhance autoregressive image generation. We focus on three techniques: scaling test-time computation for verification, aligning model preferences with Direct Preference Optimization (DPO), and integrating these techniques for complementary effects. Our results demonstrate that these approaches can be effectively adapted and combined to significantly improve image generation performance. Furthermore, given the pivotal role of reward models in our findings, we propose the Potential Assessment Reward Model (PARM) and PARM++, specialized for autoregressive image generation. PARM adaptively assesses each generation step through a potential assessment approach, merging the strengths of existing reward models, and PARM++ further introduces a reflection mechanism to self-correct the generated unsatisfactory image. Using our investigated reasoning strategies, we enhance a baseline model, Show-o, to achieve superior results, with a significant +24% improvement on the GenEval benchmark, surpassing Stable Diffusion 3 by +15%. We hope our study provides unique insights and paves a new path for integrating CoT reasoning with autoregressive image generation. Code and models are released at https://github.com/ZiyuGuo99/Image-Generation-CoT

This paper explores the application of Chain-of-Thought (CoT) reasoning strategies, traditionally used in LLMs and large multimodal models (LMMs) for complex understanding tasks, to enhance autoregressive image generation. The core idea is that autoregressive image generation, which produces images token by token, shares a similar step-by-step decoding process with LLMs and LMMs that can potentially benefit from verification and reinforcement techniques. The authors adopt Show-o (Xie et al., 22 Aug 2024 ), a recent autoregressive image generation model, as their baseline and evaluate performance on the GenEval benchmark [2024.ghosh_geneval], which specifically tests for object attributes and co-occurrence in text-to-image generation.

The investigation focuses on three main strategies: test-time computation for verification, preference alignment using Direct Preference Optimization (@@@@2@@@@), and combining these techniques.

Test-time Verification

The paper investigates using reward models as test-time verifiers, similar to how they are used to select better reasoning paths in LLMs. Two types of reward models are explored: Outcome Reward Model (ORM), which evaluates the final generated output, and Process Reward Model (PRM), which evaluates intermediate generation steps.

  • ORM Implementation: A zero-shot ORM is implemented using LLaVA-OneVision (7B) (Li et al., 6 Aug 2024 ), a capable LMM. It is prompted with the text prompt and the final generated image to assess accuracy. A large dataset of 288K text-to-image ranking examples is curated by generating multiple images per prompt with Show-o and labeling them as 'yes' (good quality) or 'no' (low quality) based on a GenEval metric. This dataset is used to fine-tune LLaVA-OneVision, creating a fine-tuned ORM for more accurate final image evaluation. A best-of-NN selection strategy is applied, where the ORM scores NN candidate images, and the highest-scoring one is selected.
  • PRM Implementation: Similar to ORM, LLaVA-OneVision (7B) is used as a zero-shot PRM, evaluating intermediate image states. A step-wise best-of-NN strategy is used, where the PRM guides the selection at each decoding step. To improve PRM, a 300K step-wise text-to-image ranking dataset is curated. Show-o is used to generate intermediate images, and an automated annotation method inspired by Math-Shepherd (Zhang et al., 14 Mar 2024 ) is employed, evaluating the potential of an intermediate step to lead to a good final image. LLaVA-OneVision is fine-tuned on this dataset to create a fine-tuned PRM.
  • Experimental Insights: Experiments on GenEval with best-of-20 selection show that test-time verification significantly improves the baseline (53% to 63% with fine-tuned ORM). ORM, which evaluates the final image, performs significantly better than PRM (63% vs 55% for fine-tuned variants). This is attributed to the nature of autoregressive image generation: early-stage images are often too blurry for PRM to evaluate effectively, while later-stage images from different paths become too similar for discrimination. Fine-tuning enhances the performance of both ORM and PRM, showing better scaling with increasing NN.

Preference Alignment

Direct Preference Optimization (DPO) [2024.rafailov_direct] is applied to align the autoregressive generation model (Show-o) with desired image quality.

  • DPO Ranking Data Curation: The yes/no labeled data collected for ORM fine-tuning is used to create a paired ranking dataset (preferred/dispreferred images for the same prompt). Approximately 10K such pairs are curated.
  • DPO Implementation: DPO's maximum likelihood objective is directly applied to fine-tune Show-o, treating it as a policy parameterized by the neural network weights. The reference policy is the original frozen Show-o. The objective encourages higher likelihood for preferred images.
  • Iterative DPO: Inspired by iterative DPO (Pang et al., 30 Apr 2024 ), the DPO-aligned model is used to generate new images, which are re-annotated, and a refined DPO dataset (7K samples) is created. Another round of DPO training is performed on this refined data.
  • Experimental Insights: Initial DPO improves the baseline by 9% (to 62%), comparable to the fine-tuned ORM. Iterative DPO further improves performance by 2% (to 65%), surpassing all test-time verifiers tested individually. This highlights DPO's effectiveness in reinforcing generation capabilities and the benefit of iterative refinement.

DPO Alignment plus Test-time Verifiers

The paper investigates combining DPO alignment and test-time verification to leverage their complementary strengths. Three integration approaches are explored:

  1. DPO with Reward Model Guidance: Integrating the fine-tuned ORM's objectives into the DPO training process, especially using prompt-only datasets, to improve generalization.
  2. Verification after DPO Alignment: Applying the fine-tuned ORM for best-of-NN selection on the model that has undergone iterative DPO alignment.
  3. Verification after DPO with Reward Model Guidance: Combining approach 1 (DPO training with ORM guidance) and approach 2 (ORM verification at test time).
  • Experimental Insights: All integration methods lead to greater improvements than individual techniques, confirming their complementary nature (see Table 2). The third approach, using fine-tuned ORM for guidance during iterative DPO training and then for verification (best-of-NN) at test time, yields the highest overall score (75%).

Potential Assessment Reward Model (PARM)

Based on the limitations observed with ORM (lacks step-wise info) and PRM (struggles with blurry early steps and similar later steps), the authors propose PARM, a reward model tailored for autoregressive image generation. PARM adaptively assesses generation steps and performs best-of-NN' selection on high-potential paths. Its methodology involves three progressive tasks:

  1. Clarity Judgment: At each intermediate step, PARM determines if the partially generated image is clear enough for meaningful assessment (binary yes/no). This skips evaluation on blurry early-stage images.
  2. Potential Assessment: For steps deemed clear, PARM assesses if the current state has high potential to lead to a high-quality final image (binary yes/no). Paths with low potential are truncated.
  3. Best-of-NN' Selection: Among the NN' paths deemed to have high potential and reaching the final step, PARM selects the best one, similar to ORM. If N=0N'=0, a fallback is used.
  • PARM Ranking Data Curation: A new 400K dataset is curated by re-annotating the ORM data prompts, divided into subsets for the three tasks (120K for clarity, 80K for potential, 200K for final selection). Automated labeling is used for clarity and potential assessment based on final image quality.
  • Experimental Insights: PARM demonstrates superior performance as a test-time verifier compared to ORM and PRM (67% vs 63% vs 55%). When integrated with iterative DPO (using PARM for guidance and verification), Show-o achieves an overall score of 77% on GenEval (see Table 3). This is a significant improvement of +24% over the baseline Show-o (53%) and surpasses Stable Diffusion 3 (Esser et al., 5 Mar 2024 ) by +15% (77% vs 62%). The gains are particularly noticeable in challenging compositional aspects like object counts, colors, position, and attribute binding.

Potential Assessment Reward Model ++ (PARM++)

PARM++ enhances PARM by incorporating a reflection mechanism to enable self-correction of generated images.

  • Reflection Mechanism: After the best image is selected by PARM, PARM++ performs a reflection evaluation, checking the alignment between the final image and the prompt. If misalignment is detected, it provides detailed textual reasons for the discrepancies.
  • Self-correction Fine-tuning: The generative model (Show-o) is fine-tuned to handle multi-modal input (original prompt, suboptimal image, reflection text) to iteratively refine the image based on the feedback until the reflection evaluation passes (max 3 iterations). A 10K dataset is curated from PARM++ data, consisting of text prompts, low-quality images, corresponding high-quality images, and generated reflection reasons.
  • Experimental Insights: Reflection in PARM++ significantly enhances image quality at test time (overall score improves from 61% to 70% when enabled). The self-correction fine-tuning slightly impacts the baseline performance (-2%), which the authors suggest might be addressed by more robust general large models in the future. Qualitative examples show that self-correction improves issues like incorrect object attributes, counts, and layouts (see Figure 8).

Conclusion and Practical Implications

The paper successfully demonstrates that CoT reasoning strategies can be effectively adapted and applied to autoregressive image generation to improve performance. Test-time verification (especially with the proposed PARM) and preference alignment (via DPO) are shown to be effective individually and achieve even greater gains when combined. The introduced PARM and PARM++ reward models are specifically designed to address the unique challenges of step-wise assessment in image generation, enabling adaptive evaluation and iterative self-correction.

For practical implementation, this research suggests augmenting an autoregressive image generation model like Show-o with:

  • A specialized reward model: PARM, trained on curated step-wise clarity and potential assessment data, along with final outcome data.
  • Preference alignment post-training: Using DPO (preferably iterative DPO) trained on a ranking dataset of preferred/dispreferred image outputs.
  • Integrated inference: Employing PARM for best-of-NN selection during inference after DPO alignment, potentially further guided by PARM during training.
  • Self-correction loop (optional but beneficial): Incorporating PARM++'s reflection mechanism and fine-tuning the generative model to act on reflection feedback for iterative refinement.

Implementation requires significant data curation for training the reward models and DPO alignment, as well as computational resources for best-of-NN sampling at inference time (running the generation model NN times) and running the reward model for scoring. The iterative self-correction also adds inference time overhead. The trade-off lies between inference cost (higher for best-of-NN and reflection) and the resulting image quality improvements, particularly in complex compositional scenarios. The code and models are released, which is crucial for practitioners looking to apply these techniques.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Ziyu Guo (49 papers)
  2. Renrui Zhang (100 papers)
  3. Chengzhuo Tong (4 papers)
  4. Zhizheng Zhao (1 paper)
  5. Peng Gao (401 papers)
  6. Hongsheng Li (340 papers)
  7. Pheng-Ann Heng (196 papers)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com