ThinkGen: Modular CoT Visual Generation

Updated 5 January 2026

ThinkGen is a modular framework for reasoning-driven visual generation that integrates explicit chain-of-thought planning with diffusion-based image synthesis.
It decouples multi-step reasoning in MLLMs from high-cost image generation in DiT, achieving state-of-the-art performance across text-to-image, editing, and compositional tasks.
The SepGRPO training paradigm alternates reinforcement learning updates between reasoning and generation modules, ensuring robust semantic alignment and fine-grained image quality.

ThinkGen is a modular framework for generalized, reasoning-intensive visual generation that leverages explicit Chain-of-Thought (CoT) planning via multimodal LLMs (MLLMs). It introduces a decoupled architecture—the MLLM provides structured, stepwise instructions derived from user intent, which are then refined and executed by a diffusion transformer (DiT) to synthesize high-quality images. The training methodology, SepGRPO (Separable Generalized Reinforcement Policy Optimization), alternates reinforcement learning updates between the reasoning and generation modules. ThinkGen establishes one unified protocol for CoT-driven visual generation across diverse tasks, including text-to-image synthesis, rendering, semantic composition, editing, and reflection, and achieves robust state-of-the-art performance on standard benchmarks (Jiao et al., 29 Dec 2025).

1. Conceptual Motivation and Scope

ThinkGen addresses the limitations of task-specific prompt engineering in previous generative frameworks by integrating systematic CoT reasoning into the image synthesis pipeline. While CoT has produced significant advances in challenging understanding tasks (e.g., math, code, vision-language QA), its deployment in generation tasks was restricted by mechanisms that failed to generalize across scenarios. The ThinkGen approach decouples “thinking” (coherent, multi-step planning in MLLM space) from “drawing” (higher-cost diffusion sampling), thereby enabling reasoning-intensive, semantically faithful, and adaptable image synthesis. Design objectives include enabling a single, generalizable CoT protocol suitable for text-to-image, rendering, editing, and compositional tasks, while maintaining modularity and scalable efficiency by separating high-throughput language reasoning (token-based) from costly pixel generation.

2. System Architecture

The ThinkGen pipeline comprises three principal components: the MLLM module, Visual Generation Instruction (VGI) refinement, and the DiT module.

MLLM Module (Qwen3-VL-8B-Think): Accepts textual (optionally visual) inputs with a dedicated system prompt triggering CoT. Generates user response in the form:
1
[User Prompt] <think> step1, step2, ... </think> rewritten_caption_or_edit_instr
Only the last two layers of hidden states post-</think> are extracted.
VGI Refinement: All hidden states before the </think> marker are truncated; K (empirically, 25) learnable prepadding states are concatenated with remaining instruction states, yielding a conditioning vector of shape $(K+L) \times d$ , with $d$ the transformer’s hidden size.
DiT Module (OmniGen2-DiT-4B): Receives the instruction vector and optional visual reference (VAE-encoded). For each diffusion timestep $t$ :

$x_t = \sqrt{\bar\alpha_t}\,x_0 + \sqrt{1-\bar\alpha_t}\,\epsilon\,,\quad \epsilon\sim\mathcal{N}(0,I)$

The model predicts the denoised velocity field $v_\theta(x_t, t, c)$ , optimizing the flow-matching objective:

$\mathcal{L}_{\text{FM}(\theta)} = \mathbb{E}_{t,\,x_0,\,x_1}\,\| (x_1 - x_0) - v_\theta(x_t, t) \|^2$

Sampling procedure (pseudo-code):

1
2
3

for t = T…1:
  v̂ = DiT(x_t, t; c)
  x_{t-1} = x_t + Δt · v̂

3. SepGRPO Training Paradigm

The SepGRPO protocol alternates reinforcement learning updates between the MLLM (policy $\pi_M(\cdot|q)$ , with $q$ the query) and DiT (policy $\pi_D(\cdot|c)$ , with $c$ its conditioning). A rollout trajectory $\tau = (q \to o \to \text{image})$ receives reward $r(\tau)$ from domain-specific rule models (evaluating alignment, semantics, aesthetics, etc).

Advantage estimation:

$\hat A_i = \frac{r_i - \mathrm{mean}_j(r_j)}{\mathrm{std}_j(r_j)}$

GRPO objective for textual (MLLM) rollouts:

$\mathcal{J}_{\mathrm{GRPO}(\theta)} = \mathbb{E}_{q,\{o_i\}\sim\pi_{\theta_{\text{old}}}} \left[ \frac{1}{\sum_i |o_i|} \sum_{i=1}^G \sum_{t=1}^{|o_i|} \min\left(r_{i,t}(\theta) \hat A_i,\, \mathrm{clip}(r_{i,t}(\theta), 1-\epsilon, 1+\epsilon)\hat A_i \right) \right] - \beta\,\mathrm{KL}(\pi_\theta\|\pi_\text{old})$

with

$r_{i,t}(\theta) = \frac{\pi_\theta(o_{i,t}|o_{i,<t},q)}{\pi_{\theta_{\text{old}}}}$

Training alternates as:

MLLM-GRPO: Freeze DiT, sample $G$ textual rollouts per prompt, generate images with frozen DiT, score rewards using five rule models (semantic composition, reasoning, text rendering, editing, reflection), update MLLM parameters.
DiT-GRPO (FlowGRPO): Freeze MLLM, sample one CoT per prompt, $G$ DiT rollouts per CoT, score as above, update DiT parameters.

4. Chain-of-Thought Reasoning Integration

CoT integration relies on explicit > ... </think> markers in MLLM generations. The prompt template for RL includes system instructions and user request, e.g.:

1
2
3
4
5
6
7
8
User: “A dog next to a car”
MLLM CoT:
<think>
  1. Identify objects “dog” and “car”.
  2. Decide spatial relation: dog on left side of car.
  3. Choose lighting: daytime, slight shadow.
</think>
Generate a brown dog standing on the left of a red sedan under daylight.
Only the concise instruction following </think> is processed by VGI-Refine for diffusion conditioning. Empirical ablations demonstrate that pruning verbose CoT traces and padding with K prepadding states enhances fine semantic alignment. Extraction of post-</think> hidden states significantly boosts GenEval and ImgEdit scores.

5. Datasets, Benchmarking, and Metrics

Datasets Used for Supervised Pretraining:

54M text-image pairs (T2I)

5M image-editing

3M text-rendering

200K in-context generation

0.7M high-quality tuning samples (1024×1024)

SepGRPO Scenario-Specific Data:

Scenario Data Size Reward

Semantic Composition 5K GenEval object-alignment

Reasoning Generation 10K HPSv3 CLIP-based score

Text Rendering 3K OCR word-accuracy

Image Editing 3K SigLIP2 cosine similarity

Reflection 3K Normalized Edit Distance

Benchmarks for Evaluation:

WISEBench (1,000 reasoning gen prompts)

RISEBench (360 reasoning editing pairs)

GenEval (553 prompts, semantic alignment)

DPG-Bench (1,065 long-form T2I)

CVTG (2,000 text rendering prompts)

ImgEdit (737 editing pairs)

Key Metrics:

GenEval alignment $[0,1]$

WISE overall $[0,1]$

RISE accuracy $[0,100]\%$

Word Accuracy (OCR)

CLIP HPSv3 $[0,1]$

SigLIP2 cosine $[-1,1]$

ImgEdit human-score (1–5)

Comparative Baselines:

Pure diffusion: SDXL, SD3.5, FLUX.1

Autoregressive token: SD3-Medium, PixArt-α

Unified Gen+Und: Emu3, OmniGen2, BAGEL, STAR, BLIP3-o, Janus-Pro

6. Performance Analysis and Ablation Studies

ThinkGen demonstrates significant advances over baselines, especially where explicit CoT is leveraged.

Reasoning Generation (WISEBench):

Without CoT: 0.55

With CoT: 0.76 (SOTA, +21 pts)

Reasoning Editing (RISEBench):

w/o CoT: 3.6

w/ CoT: 13.0 (competitive with Gemini-2.0)

T2I and Rendering:

GenEval: 0.89 (best open source)

DPG overall: 85.9%

CVTG word accuracy: 0.84 (vs. 0.80 for FLUX)

Editing (ImgEdit):

ThinkGen human-score: 4.21 (vs. AnyEdit 2.45, OmniGen2 3.44)

Ablation analyses highlight the necessity of the full pipeline and protocol:

Stage1 Connector-only: GenEval 0.78, CVTG 0.28; inadequate fine alignment.

Stage2 +60M supervised pretrain: GenEval 0.88, CVTG 0.63.

Stage3 HQ-tune (high-res): CVTG 0.75.

Stage4 MLLM-GRPO (CoT unlock): WISE 0.76.

Stage5 DiT-GRPO: CVTG 0.84.

Further, K=25 prepadding states improve semantic alignment and edit quality over none; connector as a linear layer outperforms more complex designs; truncation of all pre- tokens confers large boosts in GenEval and ImgEdit metrics.

Scenario	Data Size	Reward
Semantic Composition	5K	GenEval object-alignment
Reasoning Generation	10K	HPSv3 CLIP-based score
Text Rendering	3K	OCR word-accuracy
Image Editing	3K	SigLIP2 cosine similarity
Reflection	3K	Normalized Edit Distance

7. Limitations, Adaptability, and Future Directions

Identified limitations include occasional MLLM hallucinations in CoT traces and the high cost of RL rollouts. Real-time editing at ultra-high resolutions (>4K) remains computationally intensive.

The modular design enables facile interchange of stronger MLLM or diffusion backbones. SepGRPO is reward-type agnostic; addition of new modalities (e.g. video, 3D) is supported by supplying new rule models. Proposed extensions encompass dynamic, on-the-fly CoT compression, human-in-the-loop reward shaping, and integrating end-to-end gradient flow via differentiable DiT unrolling.

This suggests that the decoupled, reward-driven protocol of ThinkGen provides a generalizable foundation for future research in reasoning-driven visual generation and related multimodal generative tasks (Jiao et al., 29 Dec 2025).

Markdown Upgrade to Chat

References (1)

ThinkGen: Generalized Thinking for Visual Generation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ThinkGen.