ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RL (2505.24875v2)

Published 30 May 2025 in cs.CV and cs.CL

Abstract: Although chain-of-thought reasoning and reinforcement learning (RL) have driven breakthroughs in NLP, their integration into generative vision models remains underexplored. We introduce ReasonGen-R1, a two-stage framework that first imbues an autoregressive image generator with explicit text-based "thinking" skills via supervised fine-tuning on a newly generated reasoning dataset of written rationales, and then refines its outputs using Group Relative Policy Optimization. To enable the model to reason through text before generating images, We automatically generate and release a corpus of model crafted rationales paired with visual prompts, enabling controlled planning of object layouts, styles, and scene compositions. Our GRPO algorithm uses reward signals from a pretrained vision LLM to assess overall visual quality, optimizing the policy in each update. Evaluations on GenEval, DPG, and the T2I benchmark demonstrate that ReasonGen-R1 consistently outperforms strong baselines and prior state-of-the-art models. More: aka.ms/reasongen.

Summary

ReasonGen-R1: Advancing Autoregressive Image Generation through CoT and Reinforcement Learning

The paper introduces ReasonGen-R1, a novel approach that integrates Chain-of-Thought (CoT) reasoning with reinforcement learning (RL) in autoregressive image generation models. Unlike conventional text-to-image models that directly translate textual prompts to images, ReasonGen-R1 is designed to interleave text-based reasoning with image generation, thereby potentially enhancing both the interpretability and fidelity of the produced visual content. The involvement of both Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) in a two-stage framework signifies an efficient training pipeline aimed at developing "think-and-generate" capabilities in image synthesis models.

The framework begins with SFT, where the model is imbued with textual reasoning abilities. This is facilitated via exposure to a meticulous dataset compiled from the LAION aesthetics subset, annotated with CoT reasoning sequences. Each sequence is linked with associated prompt-image pairs, allowing the model to learn simultaneous reasoning and image generation. This marks a departure from earlier autoregressive models that often required discrete stages to alternate between modalities.

Following SFT, the model undergoes RL via GRPO, which utilizes an advanced vision-LLM as a reward function to assess prompt-image congruity. Notably, an adaptive entropy loss is introduced to ensure stable RL training, addressing traditional issues of mode collapse prevalent in prior models. Such innovations signify a shift in RL applications, particularly in multimodal settings where reasoning enhances contextual adherence.

Empirically, ReasonGen-R1 exhibits superior performance across several benchmarks—GenEval, DPG-Bench, and T2I-Benchmark—evidencing substantial improvements in image generation capabilities when combined with textual reasoning. This outperformance is quantitatively supported by enhancements of 6% on GenEval, 1.69% on DPG-Bench, and 13.38% on the T2I-Benchmark, compared to existing baselines including the popular Janus-Pro-7B model.

These findings have significant implications for future developments in generative AI. The integration of CoT reasoning with autoregressive frameworks could transform the conventional paradigms of image synthesis, fostering models that can better plan and envision the context before generating content. This progression may not only refine output quality but also bolster the interpretability and reliability of AI-generated content, addressing lingering challenges in visual fidelity and contextual accuracy.

However, while ReasonGen-R1 demonstrates marked improvements, the research acknowledges several limitations, including potential biases transferred from pretrained models and the need to explore generalization to broader, real-world tasks. Addressing these concerns will be crucial for extending ReasonGen-R1's utility beyond controlled environments to more varied and complex domains.

Future research directions are manifold. Exploring larger and more diverse datasets, deploying different RL architectures, and investigating cross-validation with human-in-the-loop feedback are all promising avenues. Additionally, this work sets the groundwork for developing more nuanced CoT algorithms that further refine the reasoning paths within multimodal contexts.

In conclusion, ReasonGen-R1 represents a significant advancement in the domain of autoregressive image generation, offering novel insights into the integration of reasoning and visual synthesis. It provides a robust foundation for future endeavors in AI-driven generative models, bridging gaps between human-like reasoning processes and machine-generated visual content.

Related Papers

Tweets

https://twitter.com/RobbiePasquale/status/1932754961105899970