ThinkGen: Modular Visual Generation
- ThinkGen is a generalized, think-driven visual generation framework that leverages chain-of-thought reasoning to decouple language-based instruction and pixel-level synthesis.
- Its modular design uses a pretrained multimodal LLM for explicit CoT reasoning and a Diffusion Transformer for tailored, high-fidelity image generation.
- The SepGRPO training paradigm optimizes each module separately, yielding robust performance across diverse benchmarks in text-to-image and editing tasks.
ThinkGen is a generalized, think-driven visual generation framework that explicitly leverages Chain-of-Thought (CoT) reasoning within Multimodal LLMs (MLLMs) for a wide spectrum of generation scenarios. Orchestrating explicit, decoupled reasoning and synthesis, ThinkGen employs a pretrained MLLM to generate tailored, context-sensitive instructions from user intent and a Diffusion Transformer (DiT) to produce high-fidelity images conditioned on these instructions. A separable reinforcement-learning paradigm (“SepGRPO”) enables efficient, scenario-specific optimization across diverse datasets, resulting in robust, state-of-the-art performance across multiple benchmarks and task types (Jiao et al., 29 Dec 2025).
1. Decoupled Architectural Paradigm
ThinkGen introduces a modular, decoupled architecture separating language-based reasoning from pixel-level synthesis. The system comprises two principal components:
- Pretrained MLLM (Qwen3-VL-8B-Think): Receives user prompts—either descriptive captions or editing instructions, optionally with reference images—and performs explicit autoregressive CoT reasoning. Output comprises a detailed CoT trace terminated by a special token (
</think>) and, crucially, a singular, distilled instruction summarizing the intended generation. - Diffusion Transformer (DiT, OmniGen2-DiT-4B): Consumes only the post-
</think>hidden states (distilled instruction), augmented by K=25 “Prepadding States” for conditioning stability across variable instruction lengths. In editing tasks, visual latents (from a frozen VAE) are concatenated with text states using a joint-attention transformer. The DiT module then synthesizes high-quality images with minimal textual noise.
This architectural separation enables each module to specialize: the MLLM in structured reasoning and instruction rewriting, and the DiT in controlled diffusion-based image synthesis. The boundary between reasoning and action is operationalized by the “VGI-refine” step, which strips away reasoning artifacts, passing only the distilled instruction to the image generator (Jiao et al., 29 Dec 2025).
2. Training Methodology: Separable Generative RL via Policy Optimization (SepGRPO)
ThinkGen’s training unfolds in a five-stage pipeline, merging supervised learning with iterative, module-alternating reinforcement learning to facilitate specialization and scalability while managing computational costs.
Supervised Stages:
- Stage 1: Connector Alignment. Only trains the linear connector mapping MLLM hidden states (dim 8192) to the DiT conditioning space (dim 2520), with both core modules frozen.
- Stage 2: Large-scale DiT Training. DiT parameters are unfrozen and trained using 60 million mixed text-to-image, editing, and rendering examples, optimizing the Rectified Flow/Flow Matching objective:
where .
- Stage 3: High-Quality DiT Fine-Tuning. Training on 0.7 million high-fidelity 1024×1024px samples to refine aesthetics and instruction-following.
Reinforcement Learning Stages:
- Stage 4: MLLM-GRPO. The DiT is frozen; the MLLM is optimized via Generative RL with Proximal Optimization (GRPO), maximizing downstream image quality via scenario-specific rule models (e.g., HPSv3, OCR, SigLIP2). The advantage is group-normalized:
Policy updates use a clipped surrogate objective with KL regularization:
with .
- Stage 5: DiT-GRPO. The MLLM is frozen; the DiT is optimized via FlowGRPO on selected scene and text-rendering tasks, using identical sampling and reward normalization.
SepGRPO provides three benefits: reward modularity for text/vision components, decoupled learning complexity, and reduced memory consumption (since only one module is updated per step) (Jiao et al., 29 Dec 2025).
3. CoT-Guided Instruction Generation
Instruction generation in ThinkGen is modeled as a two-phase process. The MLLM, prompted with a system message ([SYS]), generates an explicit CoT sequence, encapsulating stepwise reasoning about the prompt, and terminates with </think>. Following this, the model generates a one-sentence distilled instruction, used as the direct condition for image generation. Hidden states corresponding to these tokens, together with pre-appended Prepadding States, are extracted for consumption by the DiT.
During supervised pre-training, pseudo-CoT templates ([SYS] + [caption] + + [caption]) are employed to avoid costly rewriting. RL phases utilize real-time CoT trails, allowing for more adaptive, exploratory rewrites. The use of Prepadding States (K=25) stabilizes short-instruction conditioning, markedly improving fidelity for brief prompts (Jiao et al., 29 Dec 2025).
4. Diffusion-Based Image Synthesis
The DiT operates as a latent diffusion transformer, jointly attending to:
- Noisy image latents.
- Textual condition embeddings from the MLLM.
- Reference-image latents (for editing tasks) via a frozen VAE (concatenated through a joint-attention mechanism).
A linear connector aligns these modalities to the input dimension. Classifier-free guidance (cfg=4) is employed for the initial 60% of denoising steps, accelerating convergence without sacrificing diversity. The DiT backbone is aligned with the Rectified Flow paradigm as per Equation 1, and policy optimization during RL mirrors the GRPO surrogate objective. A Denoising Reduction protocol (20 steps, 512px) increases rollout efficiency during RL (Jiao et al., 29 Dec 2025).
5. Empirical Benchmarks and Ablative Analyses
ThinkGen is validated on diverse benchmarks:
- WISEBench (1000 prompts): World-knowledge generation.
- RISEBench (360 pairs): Reasoning-driven editing.
- GenEval (553 prompts) & DPG-Bench (1065 prompts): Semantic text-to-image alignment.
- CVTG (2000 prompts): Text rendering (measured via word accuracy).
- ImgEdit (737 pairs): Editing fidelity.
Notably, enabling CoT in ThinkGen improves WISEBench from 0.55 to 0.76, RISEBench from 3.6 to 13.0, GenEval from 0.88 to 0.89, DPG from 85.14 to 85.87, CVTG from 0.80 to 0.84, and ImgEdit from 4.14 to 4.21. Across most tasks, ThinkGen equals or outperforms closed-source baselines (GPT-4o, GPT-4o Image) (Jiao et al., 29 Dec 2025).
Ablation studies highlight the architectural significance:
- Omitting the initial connector impairs text rendering (CVTG drops from 0.75 to 0.28).
- Progressive supervised and RL stages (Stages 2–5) provide systematic, quantifiable gains across semantic and rendering tasks.
- Prepadding States substantially improve short-prompt performance on GenEval, WISE, CVTG, and ImgEdit.
6. Advantages, Limitations, and Prospective Directions
The principal strength of ThinkGen lies in the explicit decoupling of reasoning and synthesis. This specialization supports flexible integration of new reasoning scenarios (via tailored RL rewards) without full-system retraining. The CoT paradigm significantly elevates world-knowledge and reasoning-edit fidelity. SepGRPO permits tractable RL by modularizing text and vision optimization.
Identified failure modes include:
- Hallucinatory reasoning in CoT, misguiding the DiT (e.g., generating nonexistent background elements).
- Instruction verbosity, which is controlled via VGI-refine.
- Dependency on rule models for reward design; suboptimal rewards may induce adversarial prompt shortcuts in the MLLM.
Future directions include end-to-end fine-tuning across both modules, richer multimodal CoT (textual and visual traces), learning dynamic CoT step pruning, and incorporating human-in-the-loop feedback to supplement handcrafted evaluative signals such as OCR or SigLIP2 (Jiao et al., 29 Dec 2025).
A plausible implication is that this modular, explicit reasoning paradigm offers a scalable template for integrating sophisticated reasoning into large-scale generative modeling.